Open hammer opened 9 months ago
@jdstamp I managed to gather the HAPNEST example data PLINK files and convert them into a single Zarr store on GCS using the code at https://github.com/hammer/sgkitpub/issues/1. I can share this bucket with you if you email me a Google account you'd like to use.
Hi Jeff,
I was traveling so I had limited access to internet. I am gonna spend the next week focused on moving the stat gen section forward.
My brown university gmail account is the best for this: @. @.>.
Thank you!
Best,
Julian
On Jan 6, 2024, at 5:58 PM, Jeff Hammerbacher @.***> wrote:
@jdstamp https://github.com/jdstamp I managed to gather the HAPNEST example data PLINK files and put them into a single Zarr store on GCS using the code at hammer/sgkitpub#1 https://github.com/hammer/sgkitpub/issues/1. I can share this bucket with you if you email me a Google account you'd like to use.
— Reply to this email directly, view it on GitHub https://github.com/pystatgen/sgkit-publication/issues/78#issuecomment-1879754218, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4PUEQ5JFEUAKFWQ2FO5KTYNF7CLAVCNFSM6AAAAABBLQ76JOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZG42TIMRRHA. You are receiving this because you were mentioned.
Hey @jdstamp,
Sounds great!
GitHub has obfuscated your email address in this comment, please email it to me directly.
Note that in our call yesterday we decided a better way to proceed here is to download chr20 for the full HAPNEST dataset and see if we can get these associations to run at biobank scale.
Yep - the key question we want to answer is "can the techologies sgkit is built on do the kind of operations we need for GWAS at the million sample scale?" We want to illustrate the kind of thing sgkit can do to demonstrate potential, rather than try to exhaustively demonstrate its current set of features.
@jdstamp
A few updates from my initial comment:
gwas_linear_regression
on chr20 should be sufficient.Also given this focus on scalability I would like to warm up the cache of @tomwhite and @eric-czech. Over the next week or two if y'all have any time and are able to remind yourselves of the work done at https://github.com/pystatgen/sgkit/issues/390 and https://github.com/pystatgen/sgkit/issues/448#issuecomment-780655217, it would be really valuable to have a sanity check on whatever we end up writing up here. Understanding scalability bottlenecks can be hard and you expert input will be much appreciated!
Tom's summary at https://github.com/pystatgen/sgkit/issues/390#issuecomment-775822748 seems particularly useful for understanding how we got to scale for this workload
Forgot about Rafal's experiments at https://github.com/pystatgen/sgkit/issues/437
Update on working with chr20
from the full HAPNEST data:
synthetic_v1_chr-20.bed
file is 36 GBbed_reader::read_plink
tasks, which seems high?UserWarning: Sending large graph of size 9.55 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures.
read_plink
tasks have completed, so if it continues at this pace I guess it will take 3 hours ish? I regret not running this as a background task...WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS
Dask really doesn't seem to like our large file x-to-zarr conversion patterns - I wonder if it would be simpler to make our own concurrent.futures based approach? You can get a long way on a single node. File conversion really needs to be robust...
Alternatively, we could try running the GWAS without converting to Zarr - in principle we should be able to work directly from the plink files.
Yes but I fear we'd run out of resources trying to do the GWAS on the instance types I can access. Also, I'd like to do the conversion so that I can scale out. If I can't get access to the high memory instances on GCP or AWS, I'm a bit stuck. I did not realize they would just deny quota requests! I fear I'm going to have to talk to a salesperson somehow.
I do have a suspicion that we could hard-wire a saner task graph with distributed primitives than the one Dask is producing right now. One challenge for me is the implicit nature of how the Dask task graph is constructed. It looks nice in code but it's all a bit magical and hard to debug. I don't really understand the BED format well enough to understand what our bed_reader
code is doing to generate 2,356 separate tasks. If I have time I may start trying to page in the bed file format and Dask task graph construction but it's not really what I care to be doing to be honest...
Yeah, Dask is awesome when it works - pretty mysterious when it doesn't though.
Well after all that I managed to get an r7i.8xlarge
instance on EC2 with 32 vCPUs and 256 GB RAM without needing a quota upgrade and it did the conversion from PLINK to Zarr for chr20 in 12 minutes! At an hourly rate of $2.1168/hr, that's roughly $0.42.
The only wrinkle is that my Dask dashboard didn't load as our installation didn't pick up the bokeh
package which is apparently necessary for the dashboard.
I'll proceed with the association test now and see what happens.
Working through the full GWAS tutorial:
xr.plot.hist
runs for a long time and throws some timeouts so I killed it and kept moving.hardy_weinberg_test
died with KilledWorker: Attempted to run task ('arange-genotype_as_bytes-index_as_genotype-astype-0d9ea21cc0c15c028899cea1fa9b05c4', 0) on 4 different workers, but all those workers died while running it.
gwas_linear_regression
issued some warnings (PerformanceWarning: Increasing number of chunks
) and returned right away, so I think it's not getting computed until the manhattan_plot
call later? That call gave the familiar warning UserWarning: Sending large graph of size 22.80 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures.
I need to hit the gym now but will hopefully have some time this afternoon to keep digging.
Okay trying full GWAS tutorial again on a Coiled cluster and using compute()
to force some work to happen:
ds = sg.sample_stats(ds)
--> ds.compute()
gives the usual warning UserWarning: Sending large graph of size 13.13 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures.
but then when it runs I get RuntimeError: cannot cache function 'count_hom': no locator available for file '/workspaces/codespaces-jupyter/sgkit/sgkit/stats/aggregation_numba_fns.py'
which leads to RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments.
. NUMBA_CACHE_DIR
is not writable, so I tried setting that without any luck. At least I'm learning how to run code on Dask cluster machines?
import os
os.environ["SGKIT_DISABLE_NUMBA_CACHE"] = "1"
os.environ["NUMBA_CACHE_DIR"] = "/tmp/numba_cache"
def set_env(): import os os.environ["SGKIT_DISABLE_NUMBA_CACHE"] = "1" os.environ["NUMBA_CACHE_DIR"] = "/tmp/numba_cache"
client.run(set_env)
client.run_on_scheduler(set_env)
ds.compute()
- I'm starting to think `.compute()` is not the right way to force these computations? I need to wind down for the day tho. Hopefully AWS approves a high(er) memory instance for me soon and I can just scale up as scaling out does not seem to be going well...
- Got some help from the Coiled team on Slack, now launching a cluster with
```python
import coiled
cluster = coiled.Cluster(
n_workers=20,
worker_memory="64 GiB",
show_widget=False,
environ={"SGKIT_DISABLE_NUMBA_CACHE": "1", "NUMBA_CACHE_DIR": "/tmp"},
)
client = cluster.get_client()
SGKIT_DISABLE_NUMBA_CACHE
is set...Back at it today! I managed to get my quotas up to launch a real monster of a machine, a c3d-highmem-360
, with over 2 TB of RAM.
I was able to get to the sample_stats(ds)
call in the GWAS Tutorial without incident. Unfortunately when I tried to examine a data variable computed by this method such as ds.sample_call_rate
I got worker communication errors as well as errors from Dask about CPUs spending too much time in garbage collection.
A few thoughts:
Just knowing whether the association test is remotely feasible would be a great data point right now.
I went back to the r7i.8xlarge instance on EC2 with 32 vCPUs and 256 GB RAM and tried to run a minimal path to get to the linear regression and use the results to make a Manhattan plot (which uses the variant_linreg_p_value
output of the regression). Remarkably given my prior failures this ran to completion in like 20 minutes, so I think the answer is yes? Memory usage peaked around 140 GiB or so, from what I saw. I have no idea how I avoided the UserWarning: Sending large graph of size
I got last time though!
After this completed I tried to examine another output of the linear regression, variant_linreg_beta
. This is not a large array, as it only has 153,988 values in it. When I called ds_lr.variant_linreg_beta.values
, I was surprised that this too took about 20 minutes to complete. I'm not quite sure what is happening with Dask: does it need to re-run gwas_linear_regression
again for some reason? Further, I can repeatedly call ds_lr.variant_linreg_beta.values
and it takes ~20 minutes each time; something is happening that I do not understand.
So, it seems the answer is that it is feasible, but given Dask's lazy execution model I need to spend some time to understand how to isolate and profile the work done just for the linear regression.
I think a good goal here would be to try and reproduce the Manhattan plots in Fig s11 of the HAPNEST paper:
They did this with only 50K samples, though.
It's not totally obvious that they are using the same phenotypes here as they have in the released dataset, but we should see qualitatively similar patterns based on the heritability and polygenicity parameters.
I don't think there's much to learn from running QC steps here, as it's synthetic data. What we want to see/show is that we can get Manhattan plots out. Other QC steps are hopefully being illustrated by other parts of the paper.
If merging the datasets across the chromosomes is a pain, I think it's fine working chromosome by chromosome.
@jeromekelleher do you know anyone from the HAPNEST paper who we can ask to sanity check our findings given their generative model?
I know four of the authors, so yes, happy to ping them
it would be really valuable to have a sanity check on whatever we end up writing up here.
As far as the UKB replication goes, here is my recollection of what happened that would be worth keeping in mind:
ukb-analysis
repo) that I regularly used to speed up that process -- it's a much simpler demonstration of how get something somewhat close to the Neale Lab results on a single chromosome and sounds a lot like what you're doing @hammer Overall, I left this work with the impression that chunking in the samples dimension might be a nonstarter for really scaling GWAS to a UKB-sized dataset. I would recommend trying that @hammer.
I'm not sure what to say on the cluster issues you're hitting though .. I don't recognize any of those. I hit my fair share of similar problems with https://github.com/dask/dask-cloudprovider.
To some of your other issues @hammer:
gwas_linear_regression issued some warnings (PerformanceWarning: Increasing number of chunks) and returned right away, so I think it's not getting computed until the manhattan_plot call later?
That makes sense. It shouldn't compute anything until you reference one of the resulting variables.
I can repeatedly call ds_lr.variant_linreg_beta.values and it takes ~20 minutes each time; something is happening that I do not understand.
It's rerunning the whole regression. The best way to start working with the results, IIRC, is ds[['variant_linreg_beta', 'variant_linreg_p_value']] = ds[['variant_linreg_beta', 'variant_linreg_p_value']].compute()
. That will compute both arrays simultaneously and put the materialized versions back in your dataset. You could also just save them in a new dataset like gwas_results = ds[['variant_linreg_beta', 'variant_linreg_p_value']].compute()
.
There are more examples of this in that UKB notebook I mentioned like this:
That progress bar is quite helpful too. You might find some other useful examples like that in there.
Here's an overview of the work I'd like to do with @jdstamp to get this section together. I started writing it up as an email but figured I'd post it here for visibility.
I think a first milestone could be getting the HAPNEST notebook working locally. Then you can try to run the GWAS regression with all phenotypes and examine its output to be sure it's sensible. Once that's sorted, you can run and validate REGENIE and GENE-ε in whatever order seems best to you.
If all of the above is accomplished and you still want to keep going, we can consider several additional directions of work.