sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
227 stars 32 forks source link

Create a public GWAS workflow for collaborative optimization #438

Open eric-czech opened 3 years ago

eric-czech commented 3 years ago

Demonstrating GPU cost efficiency and progressing on issues raised in https://github.com/pystatgen/sgkit/issues/390 may both be aided by creating a representative public workflow for UKB GWAS. A single notebook that executes a simplified version of this code (with an initial rechunking) should be sufficient if run on simulated data (as zarr in Google Storage) with reasonably similar levels of block compressibility to real data.

Hey @quasiben, we'll update this issue with any progress towards a shared workflow for us all to look at.

jeromekelleher commented 3 years ago

Simulated genotypes from msprime should be a good dataset perf wise; let me know if I can help here. Possibly we could reproduce some of the sub-optimal behaviour on smaller datasets also, to make things a bit more nimble.

quasiben commented 3 years ago

Would calculating LD scores be a reasonable goal ?

eric-czech commented 3 years ago

Would calculating LD scores be a reasonable goal ?

I think it's probably best to avoid that one because its calculation often involves domain-specific shortcuts. I think it will be more compelling and easier for all of us if we were to start with a workflow that uses only general-purpose operators.

Here is a GWAS Simulation (gist) example that writes somewhat representative data to cloud storage and reads it out to do linear regressions (w/ several algebraic shortcuts). This is effectively all a GWAS is.

How does that look at a glance @quasiben? As far as I know all of the operations in use there are supposed to be implemented in cupy. You could also cut out the part that uses dask.array.stats.

Simulated genotypes from msprime should be a good dataset perf wise; let me know if I can help here.

It would be terrific if we could generate allele dosages with a more accurate distribution. In the case of UKB, the uncertainty in allele count comes from imputation so I'm not sure how to simulate that. It may not be important either -- simulating alternate allele counts alone like in https://github.com/pystatgen/sgkit/issues/23#issue-653340614 might give this data more similar block compressibility. I think that's likely to be the most impactful attribute of the genetic data as far as performance is concerned, at least for this example.

Let me know if you have any immediate ideas, otherwise I think we could probably start with this simpler example and add more nuance if it ends up being easy to optimize.

jeromekelleher commented 3 years ago

Ah, good points. I don't have code to simulate dosages @eric-czech. I think what you're doing in the link above is going to be a pretty good proxy for perf purposes.

quasiben commented 3 years ago

I poked at this a bit recently. @beckernick reminded me that there is a cuml/dask linear-regression but that probably won't work here. With the broken out least squares we need to fix a scipy/cupy issues within dask: https://github.com/dask/dask/issues/7537

I don't think this should be too hard to resolve once that's done we can come back to testing

eric-czech commented 3 years ago

there is a cuml/dask linear-regression but that probably won't work here

Probably -- the code is solving a large number of individual OLS regressions at once so my assumption was that packing it all into linear algebra operators was going to be faster than iterating over a dimension of the input matrix and doing the regressions individually (i.e. by calling cuml/LinearRegression.fit thousands or millions of times).

With the broken out least squares we need to fix a scipy/cupy issues within dask: dask/dask#7537

That's a bummer!

quasiben commented 3 years ago

It's not too bad. I think most of the parts we care about are working. However, the next issue might take a bit more work. I don't think there is much support in CuPy for distribution functions: cdf, pdf, sf. This is related to the last call in `def gwas):

stats.distributions.t.sf

quasiben commented 3 years ago

I also benchmarked what was currently supported and saw a modest improvement: https://gist.github.com/quasiben/f7f3099b3b250101d15dd4b3a2fda26e

n = 10000 # Number of variants (i.e. genomic locations) m = 100000 # Number of individuals (i.e. people) c = 3 # Number of covariates (i.e. confounders)

Data was pre-generated before running the example and executed on a single GPU

(sgkit) bzaitlen@dgx13:/datasets/bzaitlen/GitRepos/sgkit-example$ time python gwas-gpu.py
Results saved to file://sim_res-optimization-gpu_10000_100000_3.zarr

real    0m7.526s
user    0m14.832s
sys     0m14.059s
(sgkit) bzaitlen@dgx13:/datasets/bzaitlen/GitRepos/sgkit-example$ time python gwas.py
Results saved to file://sim_res-optimization_10000_100000_3.zarr

real    0m11.200s
user    3m41.249s
sys     5m42.028s
quasiben commented 3 years ago

I ran with the Biobank settings:

n = 141910 # Number of variants (i.e. genomic locations)
m = 365941 # Number of individuals (i.e. people)
c = 25 # Number of covariates (i.e. confounders)

And the computation took ~100 seconds. This was done on a DGX2 with 16x 32GB V100s

eric-czech commented 3 years ago

And the computation took ~100 seconds. This was done on a DGX2 with 16x 32GB V100s

Very cool @quasiben! While this isn't a completely compatible comparison, the times I saw for this on real data were ~150 seconds using 60 n1-highmem-16 GCE nodes. That is fairly downwardly biased though since the number of individuals used varies by phenotype and for those settings in the simulation, I chose a number on the high side.

Remind me again what the storage medium is? Is it possible to run the benchmark while reading from cloud storage?