sgkit-dev / sgkit-plink

NOW ARCHIVED! PLINK IO implementations for sgkit. With https://github.com/pystatgen/sgkit/issues/65, now maintained and developed in the sgkit repo.
Apache License 2.0
0 stars 4 forks source link

Pysnptools Bed API Usage #21

Closed eric-czech closed 4 years ago

eric-czech commented 4 years ago

Hey @CarlKCarlK, here's a summary of my Bed class usage:

bed = Bed('data.bed', count_A1=count_A1)

# In callback invoked by Dask for fetching a chunk:
def load_chunk(row_slice, col_slice):
    chunk = bed[slice(0, 100), slice(0, 100)].read(dtype=np.int8, view_ok=True, _require_float32_64=False).val
    return chunk

I hadn't been using the bim/fam parsers, but only because I wanted to start off building for a scenario (via dask.dataframe) where the variant and sample metadata is larger than memory. It'd like to have a more performant mode for the reader that can load the various fields as in-memory arrays. I was curious how your bim/fam parsers work -- a few questions on that:

Thanks for collaborating with us!

CarlKCarlK commented 4 years ago

RE bim/fam parsing

Summary:

This email summaries why I think one can always metadata in memory except when you can’t 😊.

It details three times when metadata-in-memory didn’t work for me and what I did instead.

Details:

            ASIDE about my vocabulary: In my Computer Science brain: “SNP”==”Variant” and “Individual”==”Sample”
          ASIDE2 it’s very fun for me to discuss these design decisions

@Eric Czechmailto:eric@related.vc asks:

For our purposes, Epistasis means running an analysis (for example, GWAS) on pairs of SNPs instead of on single SNPs. The experimental Pairs reader takes any SnpReader as input and then presents as output all pairs of SNPs. The value of a pair for an individual is defined as the product of the value at each SNP.

The good: This allows any of the FaST-LMM programs to do Epistasis analysis without any change to their code.

The bad: The # of Pairs is quadratic, not linear and may not fit in memory.

The good: Every time you slice a PySnpTools reader, you get a new subset reader which could have a reasonable number of pairs.

(Yet another premise of PySnpTools is reader should [appear] stateless. So, when we add indexing to a reader (e.,g. “bed[:,1000:1010]”, we don’t modify the reader, instead we get a new “subset” reader. For example, here we create a Bed reader, a Pair reader, and a subset of the Pair reader. from pysnptools.snpreader import Bed old_dir = os.getcwd() os.chdir(r'D:\OneDrive\programs\epireml') #!!!cmk make this work without this

    bed = Bed(r'syndata.bed',count_A1=False)
    print(f'{bed} with variant count {bed.sid_count:,}')
CarlKCarlK commented 4 years ago

Eric,

Do you want to make me a contributor to https://github.com/pystatgen/sgkit-plink/tree/master/sgkit_plink? I’ll do everything on a branch.

eric-czech commented 4 years ago

Do you want to make me a contributor to https://github.com/pystatgen/sgkit-plink/tree/master/sgkit_plink? I’ll do everything on a branch.

Sure thing, invite sent.

I apologize for not doing a good job of surfacing these things in this repo yet, but the guidelines/process we've all been following for adding new stuff is summarized to an extent in https://github.com/pystatgen/sgkit#developer-documentation. If you're not familiar with the pre-commit hooks, the idea is that a variety of code checkers run before a commit is made and will either try to fix stuff (code style, import order, etc.) or fail and tell you about some things that should be changed. It will block the commit until all those checkers clear.

The only other part of the process we've been sticking to is to have an issue associated with each PR that comes in. I made some assumptions and put this one up https://github.com/pystatgen/sgkit-plink/issues/23 for you. Let me know if any of that gives you trouble.

CarlKCarlK commented 4 years ago

Eric,

Sounds good.

I’ve been enjoying learning (for example, from Danilo) about modern Python tools such as Black and travis.com. Now, I can learn about pre-commit hooks.

Question: So, if I want to, for safety, check in non-working code to a remote repository, should just create my own remote repository? (The commit will be on its own branch.)

eric-czech commented 4 years ago

I’ve been enjoying learning (for example, from Danilo) about modern Python tools such as Black and travis.com. Now, I can learn about pre-commit hooks.

Nice! Don't want to overburden you with them but they are certainly useful.

So, if I want to, for safety, check in non-working code to a remote repository, should just create my own remote repository? (The commit will be on its own branch.)

I do that by making my own fork to work on instead, mostly to prevent an accidental push to master in the main repo. Most of us do that though there are some good reasons to work on branches in the main repo directly (we do that sometimes too). It's up to you. Sometimes it's difficult to get non-working code past the pre-commit hooks and git commit --no-verify will skip them if you need it.

eric-czech commented 4 years ago

Hey @CarlKCarlK, thanks for those details! A few thoughts on them:

Do the current loaders run as pure python?

Yes, here is the code: https://github.com/fastlmm/PySnpTools/blob/master/pysnptools/snpreader/snpreader.py

👍 I see, makes sense.

Yet another premise of PySnpTools is reader should [appear] stateless

Speaking of stateless, what's the correct way to free any resources a Bed instance might be using? I had seen and used https://github.com/fastlmm/PySnpTools/blob/master/pysnptools/snpreader/bed.py#L109 (_close_bed) in the past. I only did that because it seemed like something was leaking in an experiment where I was reading from a large number of small PLINK datasets in the same process. It wasn't clear to me what gc implications freeing that pointer might have. It could have been something else entirely but I was wondering if you would recommend any kind of exit operation when a Bed file is done being used.

A user at the Broad was running out of memory and patience, so I switched the metadata to multiple memory-mapped arrays

Gotcha, ramping up to support 90M variants in that metadata was the use case I had in mind for starting with Dask rather than pandas data frames (which just uses chunks that are themselves lazily-loaded pandas dataframes). It has its quirks though so perhaps there are some substantial advantages to memory mapping the files. It would probably depend a good bit on the access pattern.

CarlKCarlK commented 4 years ago

close_bed() should be enough. The only non-GC resource is the file pointer.

Yikes, that is big.

A design goal of PySnpTools is to make sure that “the big” doesn’t hurt the experience of “the small”. (For example, versions of FaST-LMM run on clusters, but the same versions run efficiently on local machine and local disks.) Ideally, I find a single solution that works for both “the big” and “the small”. If that is not possible, I tend to create an abstract class that can let “the small” run fast while offering special accommodations to “the big”.

CarlKCarlK commented 4 years ago

Greetings!

What examples or advice is there for having Cython and C++ libraries in the project setup?

Here is a piece of what I’m trying to port over: From https://github.com/fastlmm/PySnpTools/blob/master/setup.py

if use_cython: ext_modules = [Extension(name="pysnptools.snpreader.wrap_plink_parser", language="c++", sources=["pysnptools/snpreader/wrap_plink_parser.pyx", "pysnptools/snpreader/CPlinkBedFile.cpp"], include_dirs = [numpy.get_include()], define_macros=macros),

eric-czech commented 4 years ago

Porting more or less as-is to setup.py is a good start. I think you're safe to get something working and then we'll get some more eyes on packaging for compiled code in the PR (I don't know how to do that well).

Do we need the cython flag if the c++ parser wrapper is always available?

eric-czech commented 4 years ago

Closing this out now since the api usage was reflected in the bed-reader integration (https://github.com/pystatgen/sgkit-plink/pull/30).