sgkit-dev / sgkit-publication

Sgkit publication repository
5 stars 5 forks source link

Statistical genetics section #78

Open hammer opened 9 months ago

hammer commented 9 months ago

Here's an overview of the work I'd like to do with @jdstamp to get this section together. I started writing it up as an email but figured I'd post it here for visibility.

I think a first milestone could be getting the HAPNEST notebook working locally. Then you can try to run the GWAS regression with all phenotypes and examine its output to be sure it's sensible. Once that's sorted, you can run and validate REGENIE and GENE-ε in whatever order seems best to you.

If all of the above is accomplished and you still want to keep going, we can consider several additional directions of work.

hammer commented 9 months ago

@jdstamp I managed to gather the HAPNEST example data PLINK files and convert them into a single Zarr store on GCS using the code at https://github.com/hammer/sgkitpub/issues/1. I can share this bucket with you if you email me a Google account you'd like to use.

jdstamp commented 9 months ago

Hi Jeff,

I was traveling so I had limited access to internet. I am gonna spend the next week focused on moving the stat gen section forward.

My brown university gmail account is the best for this: @. @.>.

Thank you!

Best,

Julian

On Jan 6, 2024, at 5:58 PM, Jeff Hammerbacher @.***> wrote:

@jdstamp https://github.com/jdstamp I managed to gather the HAPNEST example data PLINK files and put them into a single Zarr store on GCS using the code at hammer/sgkitpub#1 https://github.com/hammer/sgkitpub/issues/1. I can share this bucket with you if you email me a Google account you'd like to use.

— Reply to this email directly, view it on GitHub https://github.com/pystatgen/sgkit-publication/issues/78#issuecomment-1879754218, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4PUEQ5JFEUAKFWQ2FO5KTYNF7CLAVCNFSM6AAAAABBLQ76JOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZG42TIMRRHA. You are receiving this because you were mentioned.

hammer commented 9 months ago

Hey @jdstamp,

Sounds great!

GitHub has obfuscated your email address in this comment, please email it to me directly.

Note that in our call yesterday we decided a better way to proceed here is to download chr20 for the full HAPNEST dataset and see if we can get these associations to run at biobank scale.

jeromekelleher commented 9 months ago

Yep - the key question we want to answer is "can the techologies sgkit is built on do the kind of operations we need for GWAS at the million sample scale?" We want to illustrate the kind of thing sgkit can do to demonstrate potential, rather than try to exhaustively demonstrate its current set of features.

hammer commented 9 months ago

@jdstamp

A few updates from my initial comment:

hammer commented 9 months ago

Also given this focus on scalability I would like to warm up the cache of @tomwhite and @eric-czech. Over the next week or two if y'all have any time and are able to remind yourselves of the work done at https://github.com/pystatgen/sgkit/issues/390 and https://github.com/pystatgen/sgkit/issues/448#issuecomment-780655217, it would be really valuable to have a sanity check on whatever we end up writing up here. Understanding scalability bottlenecks can be hard and you expert input will be much appreciated!

hammer commented 9 months ago

Tom's summary at https://github.com/pystatgen/sgkit/issues/390#issuecomment-775822748 seems particularly useful for understanding how we got to scale for this workload

hammer commented 9 months ago

Forgot about Rafal's experiments at https://github.com/pystatgen/sgkit/issues/437

hammer commented 8 months ago

Update on working with chr20 from the full HAPNEST data:

jeromekelleher commented 8 months ago

Dask really doesn't seem to like our large file x-to-zarr conversion patterns - I wonder if it would be simpler to make our own concurrent.futures based approach? You can get a long way on a single node. File conversion really needs to be robust...

jeromekelleher commented 8 months ago

Alternatively, we could try running the GWAS without converting to Zarr - in principle we should be able to work directly from the plink files.

hammer commented 8 months ago

Yes but I fear we'd run out of resources trying to do the GWAS on the instance types I can access. Also, I'd like to do the conversion so that I can scale out. If I can't get access to the high memory instances on GCP or AWS, I'm a bit stuck. I did not realize they would just deny quota requests! I fear I'm going to have to talk to a salesperson somehow.

I do have a suspicion that we could hard-wire a saner task graph with distributed primitives than the one Dask is producing right now. One challenge for me is the implicit nature of how the Dask task graph is constructed. It looks nice in code but it's all a bit magical and hard to debug. I don't really understand the BED format well enough to understand what our bed_reader code is doing to generate 2,356 separate tasks. If I have time I may start trying to page in the bed file format and Dask task graph construction but it's not really what I care to be doing to be honest...

jeromekelleher commented 8 months ago

Yeah, Dask is awesome when it works - pretty mysterious when it doesn't though.

hammer commented 8 months ago

Well after all that I managed to get an r7i.8xlarge instance on EC2 with 32 vCPUs and 256 GB RAM without needing a quota upgrade and it did the conversion from PLINK to Zarr for chr20 in 12 minutes! At an hourly rate of $2.1168/hr, that's roughly $0.42.

The only wrinkle is that my Dask dashboard didn't load as our installation didn't pick up the bokeh package which is apparently necessary for the dashboard.

I'll proceed with the association test now and see what happens.

hammer commented 8 months ago

Working through the full GWAS tutorial:

I need to hit the gym now but will hopefully have some time this afternoon to keep digging.

hammer commented 8 months ago

Okay trying full GWAS tutorial again on a Coiled cluster and using compute() to force some work to happen:

def set_env(): import os os.environ["SGKIT_DISABLE_NUMBA_CACHE"] = "1" os.environ["NUMBA_CACHE_DIR"] = "/tmp/numba_cache"

client.run(set_env)

client.run_on_scheduler(set_env)

ds.compute()

- I'm starting to think `.compute()` is not the right way to force these computations? I need to wind down for the day tho. Hopefully AWS approves a high(er) memory instance for me soon and I can just scale up as scaling out does not seem to be going well...
- Got some help from the Coiled team on Slack, now launching a cluster with
```python
import coiled
cluster = coiled.Cluster(
    n_workers=20,
    worker_memory="64 GiB",
    show_widget=False,
    environ={"SGKIT_DISABLE_NUMBA_CACHE": "1", "NUMBA_CACHE_DIR": "/tmp"},
)
client = cluster.get_client()
hammer commented 8 months ago

Back at it today! I managed to get my quotas up to launch a real monster of a machine, a c3d-highmem-360, with over 2 TB of RAM.

I was able to get to the sample_stats(ds) call in the GWAS Tutorial without incident. Unfortunately when I tried to examine a data variable computed by this method such as ds.sample_call_rate I got worker communication errors as well as errors from Dask about CPUs spending too much time in garbage collection.

A few thoughts:

jeromekelleher commented 8 months ago

Just knowing whether the association test is remotely feasible would be a great data point right now.

hammer commented 8 months ago

I went back to the r7i.8xlarge instance on EC2 with 32 vCPUs and 256 GB RAM and tried to run a minimal path to get to the linear regression and use the results to make a Manhattan plot (which uses the variant_linreg_p_value output of the regression). Remarkably given my prior failures this ran to completion in like 20 minutes, so I think the answer is yes? Memory usage peaked around 140 GiB or so, from what I saw. I have no idea how I avoided the UserWarning: Sending large graph of size I got last time though!

After this completed I tried to examine another output of the linear regression, variant_linreg_beta. This is not a large array, as it only has 153,988 values in it. When I called ds_lr.variant_linreg_beta.values, I was surprised that this too took about 20 minutes to complete. I'm not quite sure what is happening with Dask: does it need to re-run gwas_linear_regression again for some reason? Further, I can repeatedly call ds_lr.variant_linreg_beta.values and it takes ~20 minutes each time; something is happening that I do not understand.

So, it seems the answer is that it is feasible, but given Dask's lazy execution model I need to spend some time to understand how to isolate and profile the work done just for the linear regression.

jeromekelleher commented 8 months ago

I think a good goal here would be to try and reproduce the Manhattan plots in Fig s11 of the HAPNEST paper:

Screenshot from 2024-01-15 09-53-07

They did this with only 50K samples, though.

It's not totally obvious that they are using the same phenotypes here as they have in the released dataset, but we should see qualitatively similar patterns based on the heritability and polygenicity parameters.

I don't think there's much to learn from running QC steps here, as it's synthetic data. What we want to see/show is that we can get Manhattan plots out. Other QC steps are hopefully being illustrated by other parts of the paper.

If merging the datasets across the chromosomes is a pain, I think it's fine working chromosome by chromosome.

hammer commented 8 months ago

@jeromekelleher do you know anyone from the HAPNEST paper who we can ask to sanity check our findings given their generative model?

jeromekelleher commented 8 months ago

I know four of the authors, so yes, happy to ping them

eric-czech commented 8 months ago

it would be really valuable to have a sanity check on whatever we end up writing up here.

As far as the UKB replication goes, here is my recollection of what happened that would be worth keeping in mind:

  1. I was struggling to get a single chromosome GWAS to run ^1 and then had a small breakthrough after realizing how much faster the workflow was with short-fat chunks ^2
  2. I also regularly hit a number of intermittent issues ^3 that forced me to implement some retries in my own code like this as well as have to rebuild clusters somewhat regularly (maybe 10% of the time)
  3. Then I at least worked out that matmul in dask was the problem ^4 since it was scaling linearly (or worse) in memory usage with matrix size while using constant size chunks
  4. @ravwojdyla fixed this upstream in dask ^5
  5. When @tomwhite and I were working on this together, we noted that not chunking in the samples dimension was pretty key to keeping runtimes sane and getting in the ballpark of cost parity with Hail ^6
  6. @tomwhite summarized most things we had to do to keep moving forward with UKB nicely in ^7 and ^8
  7. I originally converted the UKB bgen data to Zarr with chunking in both dimensions and we never rechunked it to remove chunking in samples ^9
  8. Nevertheless, I was still able to get a few chromosomes to run successfully through the full workflow
  9. The main reproduction repo pipeline is quite difficult to setup and run, but there was this notebook (from our internal ukb-analysis repo) that I regularly used to speed up that process -- it's a much simpler demonstration of how get something somewhat close to the Neale Lab results on a single chromosome and sounds a lot like what you're doing @hammer
  10. The p-values I got weren't way off from p-values I was pulling from Open Targets sumstats, but they weren't strikingly similar either and you can see an example of that in the plot at the very end of the notebook
  11. I believe this issue contains at least a couple ideas I had on where differences might come from

Overall, I left this work with the impression that chunking in the samples dimension might be a nonstarter for really scaling GWAS to a UKB-sized dataset. I would recommend trying that @hammer.

I'm not sure what to say on the cluster issues you're hitting though .. I don't recognize any of those. I hit my fair share of similar problems with https://github.com/dask/dask-cloudprovider.

To some of your other issues @hammer:

gwas_linear_regression issued some warnings (PerformanceWarning: Increasing number of chunks) and returned right away, so I think it's not getting computed until the manhattan_plot call later?

That makes sense. It shouldn't compute anything until you reference one of the resulting variables.

I can repeatedly call ds_lr.variant_linreg_beta.values and it takes ~20 minutes each time; something is happening that I do not understand.

It's rerunning the whole regression. The best way to start working with the results, IIRC, is ds[['variant_linreg_beta', 'variant_linreg_p_value']] = ds[['variant_linreg_beta', 'variant_linreg_p_value']].compute(). That will compute both arrays simultaneously and put the materialized versions back in your dataset. You could also just save them in a new dataset like gwas_results = ds[['variant_linreg_beta', 'variant_linreg_p_value']].compute().

There are more examples of this in that UKB notebook I mentioned like this:

Screenshot 2024-01-25 at 4 49 29 PM

That progress bar is quite helpful too. You might find some other useful examples like that in there.