Build PyData prototype for GWAS analysis

This issue tracks several more specific issues related to working towards a usable prototype.

Some things we should tackle for this are:

[ ] IO (https://github.com/related-sciences/gwas-analysis/issues/23)
- Select IO libs to integrate and determine plugin system around them
[ ] Frontend (https://github.com/related-sciences/gwas-analysis/issues/5)
- Is Xarray really the right choice? Lack of support for out-of-core coords, uneven chunk sizes, and overlapping blockwise computations may become a huge hurdle.
[ ] Backend Dispatch (https://github.com/related-sciences/gwas-analysis/issues/24)
- How do we dispatch to duck array backends and IO plugins?
[ ] Data Structures (https://github.com/related-sciences/gwas-analysis/issues/22)
- We'll need to survey a reasonable portion of the space of all possible input structures
[ ] Methods
- Document and identify methods we actually need (https://github.com/related-sciences/gwas-analysis/issues/16)
- Implementations (https://github.com/related-sciences/gwas-analysis/issues/30)
[ ] Simulation tools (https://github.com/related-sciences/gwas-analysis/issues/31)
[ ] Testing (https://github.com/related-sciences/gwas-analysis/issues/21)
- How can we framework a solution for validation against external software, namely Hail? This will be very tedious without some abstraction
[ ] Indexing
- Should users define indexes uniquely identifying variants/phenotypes or should we manage this internally?
- Supporting PheWAS, HLA association studies, and alignment-free GWAS are examples where it would be good to leave this up to the user
- For comparison Internal Hail implementations hard code checks on indexes being equal to ['locus', 'alleles'] -- I don't think we want this
[ ] Configuration
- We should probably pin down a configuration framework early (it may be overkill but is always difficult to work in later)
- Personally, I like the idea of making configuration objects live attributes with documentation like Pandas does (this makes inline lookups convenient) though integrating this with a file-backed configuration will require some leg-work
[ ] Dask DevOps
- We need to know how to use Dask at scale
- Figuring out what is going on with it in https://h2oai.github.io/db-benchmark/ would be a good exercise
[ ] Sub-byte Representations
- It might not be too ridiculous to support some simpler (ideally early) QC operations on bitpacked int arrays
- Doing the packing at the dask/numpy level would look like this (an example from Matt)
- Alistair has some related thoughts in this post
[ ] Enrichment
- How do we add and represent data along axes (e.g. variants/samples)? The approach taken in Hail/Glow is to attach results of methods as new fields along the axes, and this is a good fit for new Dataset variables, but how will this work with multi-indexing? What happens if there are non-unique values? Is relying on Pandas indexing going to cause excessive memory overhead?
Limitations
- Since we're targeting some future state for dependencies, we should make sure to keep track of what's missing:
- https://github.com/related-sciences/rs-platform/issues/19#issuecomment-594211481 - Xarray/Dask/Numba
- https://github.com/related-sciences/gwas-analysis/issues/5#issuecomment-612600051 - Xarray limitations in more detail

I saw this issue pointed to from https://github.com/dask/dask-blog/pull/38 . Some small comments

We need to know how to use Dask at scale

FYI I'm making a company around this question. Let me know if you want to chat or be beta testers for cloud deployment products.

Figuring out what is going on with it in https://h2oai.github.io/db-benchmark/ would be a good exercise

They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair.

Benchmarks are hard to do honestly.

Hey Matt,

Let me know if you want to chat or be beta testers for cloud deployment products.

Will do, but deployment isn't a big concern quite yet.

They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair.

Good to know! It will definitely be helpful to see how we could get to that conclusion with task stream monitoring. Performance with .persist() (I assume that's what they're doing based on your description) isn't particularly interesting for us so I'm not worried about the actual times so much as being a better user. Do you happen to know if there is a dask performance report for what they did somewhere?

Not to my knowledge. The "loading from disk" behavior would be evident in the dashboard by orange/red memory bars in the memory plot as well as lots of orange bars showing up in the task stream (red is network transfer, orange is disk transfer).

On Tue, Apr 28, 2020 at 5:49 AM Eric Czech notifications@github.com wrote:

Hey Matt,

Let me know if you want to chat or be beta testers for cloud deployment products.

Will do, but deployment isn't a big concern quite yet.

They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair.

Good to know! It will definitely be helpful to see how we could get to that conclusion with task stream monitoring. Performance with .persist() (I assume that's what they're doing based on your description) isn't particularly interesting for us so I'm not worried about the actual times so much as being a better user. Do you happen to know if there is a dask performance report for what they did somewhere?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/related-sciences/gwas-analysis/issues/20#issuecomment-620585082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTB3NG3RT3M7AFKV53TRO3GEXANCNFSM4MHHA56A .

related-sciences / gwas-analysis

Build PyData prototype for GWAS analysis #20