Statistical genetics section

hammer commented 9 months ago

Here's an overview of the work I'd like to do with @jdstamp to get this section together. I started writing it up as an email but figured I'd post it here for visibility.

High-level goal: enumerate and evaluate the correctness and performance of methods in sgkit useful for statistical genetics researchers, then write up a few paragraphs in the paper describing these methods.
The paper is on GitHub at https://github.com/pystatgen/sgkit-publication. I've put it up on Overleaf as well if you'd like to have a look. It's in a bit of a rough state right now but it's worth a skim to see the how the statistical genetics section fits into the larger narrative.
We essentially have 3 statistical genetics methods, listed under the Genetic Association and Regression section of the docs: gwas_linear_regression (tests), regenie (tests), and genee (tests). The first of these is simple, so I'd start there.
To evaluate these methods, we'd like to use the HAPNEST data, as discussed at https://github.com/pystatgen/sgkit-publication/issues/43. You can download this data from BioStudies study S-BSST936. Start with the example data which is restricted to 600 individuals.
I put a Jupyter notebook on GitHub with some Python code to download the HAPNEST data and run a GWAS on it. I modified the code from the sgkit GWAS Tutorial.
We also have some Docker environments and code at https://github.com/pystatgen/sgkit/tree/main/validation/gwas to compare REGENIE results to other implementations. I don't know much about what's in there but @eric-czech might be able to answer questions if you look it over and think it could be useful.
If you need compute or storage resources to do this work, please let me know and I can add you to the Google Cloud project I've been using.

I think a first milestone could be getting the HAPNEST notebook working locally. Then you can try to run the GWAS regression with all phenotypes and examine its output to be sure it's sensible. Once that's sorted, you can run and validate REGENIE and GENE-ε in whatever order seems best to you.

If all of the above is accomplished and you still want to keep going, we can consider several additional directions of work.

Analyze the performance of each method as the data scales along the sample and variant dimensions.
Add additional tests for these methods if necessary.
Improve our documentation of these methods. One idea is to add a worked example over real data as we started but did not finish at https://github.com/pystatgen/sgkit/issues/688.
Add capabilities to export our data to the input formats expected by popular standalone statistical genetics tools like SAIGE and REGENIE.
Improve our implemenations, e.g. https://github.com/pystatgen/sgkit/issues/220, https://github.com/pystatgen/sgkit/issues/597.
Try to scale these methods to the full HAPNEST data. Cf. https://github.com/pystatgen/sgkit/issues/390, https://github.com/pystatgen/sgkit/issues/461.
Do a more complete GWAS pipeline over sequencing-derived biobank-scale variation data. Cf. https://github.com/pystatgen/sgkit/discussions/696, https://github.com/related-sciences/ukb-gwas-pipeline-nealelab
Compare the complexity of implementation of these methods with sgkit abstractions to other implementations, including from other language ecosystems. Cf. https://github.com/pystatgen/sgkit/discussions/701, https://github.com/pystatgen/sgkit/issues/448 (Hail). https://github.com/privefl/bigsnpr (R) and https://github.com/OpenMendel (Julia) also look interesting.

hammer commented 9 months ago

@jdstamp I managed to gather the HAPNEST example data PLINK files and convert them into a single Zarr store on GCS using the code at https://github.com/hammer/sgkitpub/issues/1. I can share this bucket with you if you email me a Google account you'd like to use.

jdstamp commented 9 months ago

Hi Jeff,

I was traveling so I had limited access to internet. I am gonna spend the next week focused on moving the stat gen section forward.

My brown university gmail account is the best for this: @. @.>.

Thank you!

Best,

Julian

On Jan 6, 2024, at 5:58 PM, Jeff Hammerbacher @.***> wrote:

@jdstamp https://github.com/jdstamp I managed to gather the HAPNEST example data PLINK files and put them into a single Zarr store on GCS using the code at hammer/sgkitpub#1 https://github.com/hammer/sgkitpub/issues/1. I can share this bucket with you if you email me a Google account you'd like to use.

— Reply to this email directly, view it on GitHub https://github.com/pystatgen/sgkit-publication/issues/78#issuecomment-1879754218, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4PUEQ5JFEUAKFWQ2FO5KTYNF7CLAVCNFSM6AAAAABBLQ76JOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZG42TIMRRHA. You are receiving this because you were mentioned.

hammer commented 9 months ago

Hey @jdstamp,

Sounds great!

GitHub has obfuscated your email address in this comment, please email it to me directly.

Note that in our call yesterday we decided a better way to proceed here is to download chr20 for the full HAPNEST dataset and see if we can get these associations to run at biobank scale.

jeromekelleher commented 9 months ago

Yep - the key question we want to answer is "can the techologies sgkit is built on do the kind of operations we need for GWAS at the million sample scale?" We want to illustrate the kind of thing sgkit can do to demonstrate potential, rather than try to exhaustively demonstrate its current set of features.

hammer commented 9 months ago

@jdstamp

A few updates from my initial comment:

Jerome has put the manuscript on Overleaf at https://www.overleaf.com/project/659c19afc265913df38df1ab
The starting point for your work remains the same: I think it would be valuable to ensure you can run all 3 association tests on the example data and that they produce the expected output. This will ensure we're shipping working code to users.
The primary change is that for writing up the section we'd like to focus on scaling a single method. So, the bullet for "Try to scale these methods to the full HAPNEST data" is no longer a nice to have, it's a must. Scaling up gwas_linear_regression on chr20 should be sufficient.

hammer commented 9 months ago

Also given this focus on scalability I would like to warm up the cache of @tomwhite and @eric-czech. Over the next week or two if y'all have any time and are able to remind yourselves of the work done at https://github.com/pystatgen/sgkit/issues/390 and https://github.com/pystatgen/sgkit/issues/448#issuecomment-780655217, it would be really valuable to have a sanity check on whatever we end up writing up here. Understanding scalability bottlenecks can be hard and you expert input will be much appreciated!

hammer commented 9 months ago

Tom's summary at https://github.com/pystatgen/sgkit/issues/390#issuecomment-775822748 seems particularly useful for understanding how we got to scale for this workload

hammer commented 9 months ago

Forgot about Rafal's experiments at https://github.com/pystatgen/sgkit/issues/437

hammer commented 8 months ago

Update on working with chr20 from the full HAPNEST data:

The synthetic_v1_chr-20.bed file is 36 GB
It took about 30 minutes to get out of FTP onto GitHub Codespaces VM
It took only 3 minutes to write from the VM to GCS
I am using the largest Codespaces VM: 16-core, 64GB RAM, 128GB storage
I kicked off the PLINK to Zarr conversion with 8 workers. This generated 2,356 bed_reader::read_plink tasks, which seems high?
Dask immediately hit me with this warning: UserWarning: Sending large graph of size 9.55 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures.
I'm now 15 minutes into the job and 188 of the read_plink tasks have completed, so if it continues at this pace I guess it will take 3 hours ish? I regret not running this as a background task...
About 20 minutes into the job I lost 2 workers? RIP guys
A few minutes later 1 more worker has died and I have a new warning: WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS
Killed it after 30 minutes and only 4 workers remaining. Moving backwards on tasks completed was not fun to watch.
Trying again with a 10 node Coiled cluster...
Of course we can't read from the cloud so need to find some way to mount a filesystem on the workers blerg
Decided to just grab a high memory instance from GCP
Quota increase request denied on GCP for M2 and M3 instances...
AWS also won't let me launch high memory instances calling it a night.

jeromekelleher commented 8 months ago

Dask really doesn't seem to like our large file x-to-zarr conversion patterns - I wonder if it would be simpler to make our own concurrent.futures based approach? You can get a long way on a single node. File conversion really needs to be robust...

jeromekelleher commented 8 months ago

Alternatively, we could try running the GWAS without converting to Zarr - in principle we should be able to work directly from the plink files.

hammer commented 8 months ago

Yes but I fear we'd run out of resources trying to do the GWAS on the instance types I can access. Also, I'd like to do the conversion so that I can scale out. If I can't get access to the high memory instances on GCP or AWS, I'm a bit stuck. I did not realize they would just deny quota requests! I fear I'm going to have to talk to a salesperson somehow.

I do have a suspicion that we could hard-wire a saner task graph with distributed primitives than the one Dask is producing right now. One challenge for me is the implicit nature of how the Dask task graph is constructed. It looks nice in code but it's all a bit magical and hard to debug. I don't really understand the BED format well enough to understand what our bed_reader code is doing to generate 2,356 separate tasks. If I have time I may start trying to page in the bed file format and Dask task graph construction but it's not really what I care to be doing to be honest...

jeromekelleher commented 8 months ago

Yeah, Dask is awesome when it works - pretty mysterious when it doesn't though.

hammer commented 8 months ago

Well after all that I managed to get an r7i.8xlarge instance on EC2 with 32 vCPUs and 256 GB RAM without needing a quota upgrade and it did the conversion from PLINK to Zarr for chr20 in 12 minutes! At an hourly rate of $2.1168/hr, that's roughly $0.42.

The only wrinkle is that my Dask dashboard didn't load as our installation didn't pick up the bokeh package which is apparently necessary for the dashboard.

I'll proceed with the association test now and see what happens.

hammer commented 8 months ago

Working through the full GWAS tutorial:

xr.plot.hist runs for a long time and throws some timeouts so I killed it and kept moving.
hardy_weinberg_test died with KilledWorker: Attempted to run task ('arange-genotype_as_bytes-index_as_genotype-astype-0d9ea21cc0c15c028899cea1fa9b05c4', 0) on 4 different workers, but all those workers died while running it.
gwas_linear_regression issued some warnings (PerformanceWarning: Increasing number of chunks) and returned right away, so I think it's not getting computed until the manhattan_plot call later? That call gave the familiar warning UserWarning: Sending large graph of size 22.80 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures.

I need to hit the gym now but will hopefully have some time this afternoon to keep digging.

hammer commented 8 months ago

Okay trying full GWAS tutorial again on a Coiled cluster and using compute() to force some work to happen:

ds = sg.sample_stats(ds) --> ds.compute() gives the usual warning UserWarning: Sending large graph of size 13.13 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures. but then when it runs I get RuntimeError: cannot cache function 'count_hom': no locator available for file '/workspaces/codespaces-jupyter/sgkit/sgkit/stats/aggregation_numba_fns.py' which leads to RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments..
I thought the error above might be due to the Numba cache as I've heard y'all complain about it previously. I tried disabling the Numba cache on the launching machine, the scheduler, and the workers. This didn't help. I then saw that others have gotten this error when the NUMBA_CACHE_DIR is not writable, so I tried setting that without any luck. At least I'm learning how to run code on Dask cluster machines?
```
import os
os.environ["SGKIT_DISABLE_NUMBA_CACHE"] = "1"
os.environ["NUMBA_CACHE_DIR"] = "/tmp/numba_cache"
```

def set_env(): import os os.environ["SGKIT_DISABLE_NUMBA_CACHE"] = "1" os.environ["NUMBA_CACHE_DIR"] = "/tmp/numba_cache"

client.run(set_env)

client.run_on_scheduler(set_env)

ds.compute()

- I'm starting to think `.compute()` is not the right way to force these computations? I need to wind down for the day tho. Hopefully AWS approves a high(er) memory instance for me soon and I can just scale up as scaling out does not seem to be going well...
- Got some help from the Coiled team on Slack, now launching a cluster with
```python
import coiled
cluster = coiled.Cluster(
    n_workers=20,
    worker_memory="64 GiB",
    show_widget=False,
    environ={"SGKIT_DISABLE_NUMBA_CACHE": "1", "NUMBA_CACHE_DIR": "/tmp"},
)
client = cluster.get_client()

The above did not work. The Numba issue at https://github.com/numba/numba/issues/4908 maybe indicates the issues is running from an egg/zip rather than permissions to write? I still don't understand why Numba wants to write to disk when SGKIT_DISABLE_NUMBA_CACHE is set...

hammer commented 8 months ago

Back at it today! I managed to get my quotas up to launch a real monster of a machine, a c3d-highmem-360, with over 2 TB of RAM.

I was able to get to the sample_stats(ds) call in the GWAS Tutorial without incident. Unfortunately when I tried to examine a data variable computed by this method such as ds.sample_call_rate I got worker communication errors as well as errors from Dask about CPUs spending too much time in garbage collection.

A few thoughts:

The communication errors may be due to bad firewall rules? I tried opening up a lot of things but could not resolve them.
I'm not sure what could be causing the CPU to spend too much time in garbage collection. The instance clearly had tons of RAM, and each worker had 16 GB. There were a lot of transpose tasks on the Dask dashboard, and I know that's an expensive operation, but I'm not sure what to do next to debug.
I realize I'm getting stuck on QC steps rather than just the association test. I can move on to just looking at the association test but it feels a bit concerning that we can't get through the GWAS Tutorial workload by either scaling up or out. It could be user error, of course!

jeromekelleher commented 8 months ago

Just knowing whether the association test is remotely feasible would be a great data point right now.

hammer commented 8 months ago

I went back to the r7i.8xlarge instance on EC2 with 32 vCPUs and 256 GB RAM and tried to run a minimal path to get to the linear regression and use the results to make a Manhattan plot (which uses the variant_linreg_p_value output of the regression). Remarkably given my prior failures this ran to completion in like 20 minutes, so I think the answer is yes? Memory usage peaked around 140 GiB or so, from what I saw. I have no idea how I avoided the UserWarning: Sending large graph of size I got last time though!

After this completed I tried to examine another output of the linear regression, variant_linreg_beta. This is not a large array, as it only has 153,988 values in it. When I called ds_lr.variant_linreg_beta.values, I was surprised that this too took about 20 minutes to complete. I'm not quite sure what is happening with Dask: does it need to re-run gwas_linear_regression again for some reason? Further, I can repeatedly call ds_lr.variant_linreg_beta.values and it takes ~20 minutes each time; something is happening that I do not understand.

So, it seems the answer is that it is feasible, but given Dask's lazy execution model I need to spend some time to understand how to isolate and profile the work done just for the linear regression.

jeromekelleher commented 8 months ago

I think a good goal here would be to try and reproduce the Manhattan plots in Fig s11 of the HAPNEST paper:

Screenshot from 2024-01-15 09-53-07

They did this with only 50K samples, though.

It's not totally obvious that they are using the same phenotypes here as they have in the released dataset, but we should see qualitatively similar patterns based on the heritability and polygenicity parameters.

I don't think there's much to learn from running QC steps here, as it's synthetic data. What we want to see/show is that we can get Manhattan plots out. Other QC steps are hopefully being illustrated by other parts of the paper.

If merging the datasets across the chromosomes is a pain, I think it's fine working chromosome by chromosome.

hammer commented 8 months ago

@jeromekelleher do you know anyone from the HAPNEST paper who we can ask to sanity check our findings given their generative model?

jeromekelleher commented 8 months ago

I know four of the authors, so yes, happy to ping them

eric-czech commented 8 months ago

it would be really valuable to have a sanity check on whatever we end up writing up here.

As far as the UKB replication goes, here is my recollection of what happened that would be worth keeping in mind:

I was struggling to get a single chromosome GWAS to run ^1 and then had a small breakthrough after realizing how much faster the workflow was with short-fat chunks ^2
I also regularly hit a number of intermittent issues ^3 that forced me to implement some retries in my own code like this as well as have to rebuild clusters somewhat regularly (maybe 10% of the time)
Then I at least worked out that matmul in dask was the problem ^4 since it was scaling linearly (or worse) in memory usage with matrix size while using constant size chunks
@ravwojdyla fixed this upstream in dask ^5
When @tomwhite and I were working on this together, we noted that not chunking in the samples dimension was pretty key to keeping runtimes sane and getting in the ballpark of cost parity with Hail ^6
@tomwhite summarized most things we had to do to keep moving forward with UKB nicely in ^7 and ^8
I originally converted the UKB bgen data to Zarr with chunking in both dimensions and we never rechunked it to remove chunking in samples ^9
Nevertheless, I was still able to get a few chromosomes to run successfully through the full workflow
The main reproduction repo pipeline is quite difficult to setup and run, but there was this notebook (from our internal ukb-analysis repo) that I regularly used to speed up that process -- it's a much simpler demonstration of how get something somewhat close to the Neale Lab results on a single chromosome and sounds a lot like what you're doing @hammer
The p-values I got weren't way off from p-values I was pulling from Open Targets sumstats, but they weren't strikingly similar either and you can see an example of that in the plot at the very end of the notebook
I believe this issue contains at least a couple ideas I had on where differences might come from

Overall, I left this work with the impression that chunking in the samples dimension might be a nonstarter for really scaling GWAS to a UKB-sized dataset. I would recommend trying that @hammer.

I'm not sure what to say on the cluster issues you're hitting though .. I don't recognize any of those. I hit my fair share of similar problems with https://github.com/dask/dask-cloudprovider.

To some of your other issues @hammer:

gwas_linear_regression issued some warnings (PerformanceWarning: Increasing number of chunks) and returned right away, so I think it's not getting computed until the manhattan_plot call later?

That makes sense. It shouldn't compute anything until you reference one of the resulting variables.

I can repeatedly call ds_lr.variant_linreg_beta.values and it takes ~20 minutes each time; something is happening that I do not understand.

It's rerunning the whole regression. The best way to start working with the results, IIRC, is ds[['variant_linreg_beta', 'variant_linreg_p_value']] = ds[['variant_linreg_beta', 'variant_linreg_p_value']].compute(). That will compute both arrays simultaneously and put the materialized versions back in your dataset. You could also just save them in a new dataset like gwas_results = ds[['variant_linreg_beta', 'variant_linreg_p_value']].compute().

There are more examples of this in that UKB notebook I mentioned like this:

Screenshot 2024-01-25 at 4 49 29 PM

That progress bar is quite helpful too. You might find some other useful examples like that in there.

sgkit-dev / sgkit-publication

Statistical genetics section #78