Closed eric-czech closed 4 years ago
RE bim/fam parsing
Summary:
This email summaries why I think one can always metadata in memory except when you can’t 😊.
It details three times when metadata-in-memory didn’t work for me and what I did instead.
Details:
ASIDE about my vocabulary: In my Computer Science brain: “SNP”==”Variant” and “Individual”==”Sample”
ASIDE2 it’s very fun for me to discuss these design decisions
@Eric Czechmailto:eric@related.vc asks:
Do the current loaders run as pure python? Yes, here is the code: https://github.com/fastlmm/PySnpTools/blob/master/pysnptools/snpreader/snpreader.py
@staticmethod def _read_fam(basefilename, remove_suffix, add_suffix='fam'): famfile = SnpReader._name_of_other_file(basefilename, remove_suffix, add_suffix)
logging.info("Loading {0} file {1}".format(add_suffix, famfile))
if os.path.getsize(famfile)>0:
iid = np.loadtxt(famfile, dtype = 'str',usecols=(0,1),comments=None)
else:
iid = np.empty((0,2), dtype = 'str')
if len(iid.shape) == 1: #When empty or just one item, make sure the result is (x,2)
iid = iid.reshape((len(iid)//2,2))
return iid
@staticmethod def _read_map_or_bim( basefilename, remove_suffix, add_suffix): mapfile = SnpReader._name_of_other_file(basefilename, remove_suffix, add_suffix)
logging.info("Loading {0} file {1}".format(add_suffix, mapfile))
if os.path.getsize(mapfile) == 0: #If the map/bim file is empty, return empty arrays
sid = np.array([],dtype='str')
pos = np.array([[]],dtype=int).reshape(0,3)
return sid,pos
else:
fields = pd.read_csv(mapfile,delimiter = '\t',usecols = (0,1,2,3),header=None,index_col=False,comment=None)
sid = np.array(fields[1].tolist(),dtype='str')
pos = fields[[0,2,3]].values
return sid,pos
Would you say they're likely to be faster than simply loading those files into memory as a pandas dataframe? I use the Pandas parser and then generate the numpy arrays that PySnpTools requires. It also seems fast enough, except … [See answer to next question]
Can they fetch chunks of arrays? No, but … A premise of PySnpTools is that the memory and load time needed by linear things such as samples and variants metadata doesn’t matter. The reason? Because we are also dealing with a quadradic thing, namely the array of SNP (or SNP-like) values. So, we do a one-time loading of the linear things into memory. After that, fetching from chunks is easy. But, I can think of at least three times when this premise failed. Here is what I did in each case: DistributedBed https://github.com/fastlmm/PySnpTools/blob/master/pysnptools/snpreader/distributedbed.py The idea of DistributedBed is to read from multiple Bed files as if they were one. Data is split across variants. To access and index into 100’s of Bed files, metadata is kept in memory as above. However, reading from 100’s of bim files was too slow (minutes), so the metadata is saved to and loaded from a *.npz file (around a second). (Another premise of PySnpTools is that load times should seem instantaneous.) Epistasis https://github.com/fastlmm/PySnpTools/blob/pairs/pysnptools/snpreader/pairs.py [un-released feature branch called “pairs”]
For our purposes, Epistasis means running an analysis (for example, GWAS) on pairs of SNPs instead of on single SNPs. The experimental Pairs reader takes any SnpReader as input and then presents as output all pairs of SNPs. The value of a pair for an individual is defined as the product of the value at each SNP.
The good: This allows any of the FaST-LMM programs to do Epistasis analysis without any change to their code.
The bad: The # of Pairs is quadratic, not linear and may not fit in memory.
The good: Every time you slice a PySnpTools reader, you get a new subset reader which could have a reasonable number of pairs.
(Yet another premise of PySnpTools is reader should [appear] stateless. So, when we add indexing to a reader (e.,g. “bed[:,1000:1010]”, we don’t modify the reader, instead we get a new “subset” reader. For example, here we create a Bed reader, a Pair reader, and a subset of the Pair reader. from pysnptools.snpreader import Bed old_dir = os.getcwd() os.chdir(r'D:\OneDrive\programs\epireml') #!!!cmk make this work without this
bed = Bed(r'syndata.bed',count_A1=False)
print(f'{bed} with variant count {bed.sid_count:,}')
Bed('syndata.bed',count_A1=False) with variant count 27,000 pairs = Pairs(bed) print(f'{pairs} with variant count {pairs.sid_count:,}')
Pairs(Bed('syndata.bed',count_A1=False)) with variant count 364,486,500 subset = pairs[:10,10000:10010]
Pairs(Bed('syndata.bed',count_A1=False))[:10,10000:10010] with variant count ??? print(f'{subset} with variant count ???') print(subset.read().val.shape) # Doesn't work, yet print(subset.sid) # Doesn't work, yet The bad: My user decide our old special-purpose Epistasis codehttps://fastlmm.github.io/FaST-LMM/#epistasis was all she needed, so I paused work on the Pairs feature, so it doesn’t actually work yet. What it needs is for readers that wish to (for example, Pairs) to be able to override indexing into the rows or columns. The _subset class would call these (perhaps overridden) indexing methods and avoid creating an array of, say, 364 million items. Bgen At first I couldn’t believe how big the UK Biobank files are. (Chrom #1, about 487,409 individuals x 5,765,294 variants!). A user at the Broad was running out of memory and patience, so I switched the metadata to multiple memory-mapped arrays in one file. This seems to be working great. (https://github.com/limix/bgen-reader-py/blob/bgen2memmap/bgen_reader/_multimemmap.py)
Carl
Eric,
Do you want to make me a contributor to https://github.com/pystatgen/sgkit-plink/tree/master/sgkit_plink? I’ll do everything on a branch.
Do you want to make me a contributor to https://github.com/pystatgen/sgkit-plink/tree/master/sgkit_plink? I’ll do everything on a branch.
Sure thing, invite sent.
I apologize for not doing a good job of surfacing these things in this repo yet, but the guidelines/process we've all been following for adding new stuff is summarized to an extent in https://github.com/pystatgen/sgkit#developer-documentation. If you're not familiar with the pre-commit hooks, the idea is that a variety of code checkers run before a commit is made and will either try to fix stuff (code style, import order, etc.) or fail and tell you about some things that should be changed. It will block the commit until all those checkers clear.
The only other part of the process we've been sticking to is to have an issue associated with each PR that comes in. I made some assumptions and put this one up https://github.com/pystatgen/sgkit-plink/issues/23 for you. Let me know if any of that gives you trouble.
Eric,
Sounds good.
I’ve been enjoying learning (for example, from Danilo) about modern Python tools such as Black and travis.com. Now, I can learn about pre-commit hooks.
Question: So, if I want to, for safety, check in non-working code to a remote repository, should just create my own remote repository? (The commit will be on its own branch.)
I’ve been enjoying learning (for example, from Danilo) about modern Python tools such as Black and travis.com. Now, I can learn about pre-commit hooks.
Nice! Don't want to overburden you with them but they are certainly useful.
So, if I want to, for safety, check in non-working code to a remote repository, should just create my own remote repository? (The commit will be on its own branch.)
I do that by making my own fork to work on instead, mostly to prevent an accidental push to master in the main repo. Most of us do that though there are some good reasons to work on branches in the main repo directly (we do that sometimes too). It's up to you. Sometimes it's difficult to get non-working code past the pre-commit hooks and git commit --no-verify
will skip them if you need it.
Hey @CarlKCarlK, thanks for those details! A few thoughts on them:
Do the current loaders run as pure python?
Yes, here is the code: https://github.com/fastlmm/PySnpTools/blob/master/pysnptools/snpreader/snpreader.py
👍 I see, makes sense.
Yet another premise of PySnpTools is reader should [appear] stateless
Speaking of stateless, what's the correct way to free any resources a Bed instance might be using? I had seen and used https://github.com/fastlmm/PySnpTools/blob/master/pysnptools/snpreader/bed.py#L109 (_close_bed
) in the past. I only did that because it seemed like something was leaking in an experiment where I was reading from a large number of small PLINK datasets in the same process. It wasn't clear to me what gc implications freeing that pointer might have. It could have been something else entirely but I was wondering if you would recommend any kind of exit operation when a Bed file is done being used.
A user at the Broad was running out of memory and patience, so I switched the metadata to multiple memory-mapped arrays
Gotcha, ramping up to support 90M variants in that metadata was the use case I had in mind for starting with Dask rather than pandas data frames (which just uses chunks that are themselves lazily-loaded pandas dataframes). It has its quirks though so perhaps there are some substantial advantages to memory mapping the files. It would probably depend a good bit on the access pattern.
close_bed() should be enough. The only non-GC resource is the file pointer.
Yikes, that is big.
A design goal of PySnpTools is to make sure that “the big” doesn’t hurt the experience of “the small”. (For example, versions of FaST-LMM run on clusters, but the same versions run efficiently on local machine and local disks.) Ideally, I find a single solution that works for both “the big” and “the small”. If that is not possible, I tend to create an abstract class that can let “the small” run fast while offering special accommodations to “the big”.
Greetings!
What examples or advice is there for having Cython and C++ libraries in the project setup?
Here is a piece of what I’m trying to port over: From https://github.com/fastlmm/PySnpTools/blob/master/setup.py
if use_cython: ext_modules = [Extension(name="pysnptools.snpreader.wrap_plink_parser", language="c++", sources=["pysnptools/snpreader/wrap_plink_parser.pyx", "pysnptools/snpreader/CPlinkBedFile.cpp"], include_dirs = [numpy.get_include()], define_macros=macros),
Porting more or less as-is to setup.py is a good start. I think you're safe to get something working and then we'll get some more eyes on packaging for compiled code in the PR (I don't know how to do that well).
Do we need the cython flag if the c++ parser wrapper is always available?
Closing this out now since the api usage was reflected in the bed-reader integration (https://github.com/pystatgen/sgkit-plink/pull/30).
Hey @CarlKCarlK, here's a summary of my
Bed
class usage:I hadn't been using the bim/fam parsers, but only because I wanted to start off building for a scenario (via dask.dataframe) where the variant and sample metadata is larger than memory. It'd like to have a more performant mode for the reader that can load the various fields as in-memory arrays. I was curious how your bim/fam parsers work -- a few questions on that:
Thanks for collaborating with us!