related-sciences / gwas-analysis

GWAS data analysis experiments
Apache License 2.0
24 stars 6 forks source link

PyData prototype LD prune implementation #26

Closed eric-czech closed 4 years ago

eric-czech commented 4 years ago

To make this prototype more compelling, we should add implementations of some of the more challenging algorithms on the critical path. LD pruning is a great example of that.

eric-czech commented 4 years ago

I added an implementation using cuda.jit in tsgpu_backend.py. Some notes on this:

Specifics on the implementation:

hammer commented 4 years ago

It does not support bp windows

I'm a little confused by what this means. I see window = 50 passed as a parameter in your notebook. Is that parameter ignored right now?

eric-czech commented 4 years ago

Is that parameter ignored right now?

Nope, it's used but it corresponds to a window size in number of variants. This isn't an uncommon thing to do -- it's what Marees does in the tutorial paper and what scikit-allel does -- but UKBB and Hail use bp windows (PLINK provides either option). I'd rather support both though, and now we know it might actually be worth the effort to figure out how.

hammer commented 4 years ago

Ah okay, thanks for the clarification.

From the notebook:

A single non-chunked array is used until a good solution to dask#2403 makes it possible to do the block overlap required here.

What do we gain once dask#2403 is implemented? Why is array chunking useful here? Will we get faster runtimes due to on-GPU parallelism, or can we use GPUs with less on-GPU memory because we can page blocks in and out, or something else?

Sorry for the dumb questions, I should probably make more time to understand these algorithms better given how much you've written on them already.

eric-czech commented 4 years ago

What do we gain once dask#2403 is implemented?

That one is just about the correctness of the algorithm, not any kind of performance thing. If you imagine a chunked array that is only chunked horizontally (i.e. tall and skinny), then the rows near the bottom of any one block need to be compared to the rows in the top of the block below it since the window spans blocks. The map_overlap function is perfect for sharing some information between blocks near the boundaries, but it only does it for a single array. In this case we need the genotype calls as well as contigs, position, and MAF (or some other score) column vectors to be overlapped too since they're all part of the comparison logic.

I should probably make more time to understand these algorithms

No worries, the two minute explanation probably saves a lot of time and it's good to have these little blurbs somewhere IMO

eric-czech commented 4 years ago

Oh but it does definitely mean that if the arrays are chunked, then we can process big datasets by passing individual blocks to a GPU (so we can size the blocks to fit GPU memory). Sorry, I think I breezed past your point there -- this is definitely crucial for making this algo work for any data bigger than GPU memory.

hammer commented 4 years ago

the window spans blocks

Got it. We had a similar issue when a single MapReduce record was split across 2 HDFS blocks, e.g. https://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries.

eric-czech commented 4 years ago

Some updates on this:

eric-czech commented 4 years ago

I also added a notebook running LD prune on simulated UKBB data. It was easy to get a representative sampling in this case by concatenating row vectors from 1KG since LD is preserved that way. It also runs only using chr 8 variants near the 5' telomere, which is more dense than any other chromosome. Altogether, it looks like it would take at most ~22 hours to do on my workstation (and would equate to about 352 single CPU hours).

eric-czech commented 4 years ago

This is mostly done now, though there are several areas for optimization that should be pursued:

  1. Pre-compute squared vectors for LD calcs in loops where it's easy to do so
  2. Find a way to use some BLAS/LAPACK implementation to do the LD calcs in batches
    • Right now, I'm using the scikit-allel definition of R2 which has some special considerations for missing values but is afaict equivalent to pearson correlation if rows were normalized first
  3. Tryout numba optimizations like fastmath and parallel
    • fastmath activates special Intel SVML optimizations if the icc_rt package is installed
    • fastmath does instruction pipelining so presumably this an alternative to trying to convert calculations to something that uses native BLAS/LAPACK implementation
eric-czech commented 4 years ago

Well I tried the fastmath option at least and it makes a huge difference on our current R2 function (it's a great fit for instruction pipelining). Here is a notebook trying this as well as the parallel option. It's about 7x faster with that flag turned on.

Unfortunately, that 7x improvement only translates to a 2x speedup on UKBB/1KG LD prune/matrix benchmarks, but that does mean I can safely say it should only take around 11 hours to prune UKBB on one machine. There must be a good number of other things going on in the ld_matrix function that are relatively slow compared to the R2 calculation part.

eric-czech commented 4 years ago

Closing this now as it and the upstream parts are correct and fairly well documented/tested. Any more comprehensive testing should follow from https://github.com/related-sciences/gwas-analysis/issues/31 and I'll leave off any further optimization until it seems necessary. To re-iterate, the biggest room for optimization by far would come from simply skipping R2 calculations in windows once a single variant pair above the threshold is found (like Hail) leading to results that will differ based on the partitioning, and be kind of questionable.

eric-czech commented 4 years ago

For reference, downstream of some moderate level of variant QC, we can expect variant counts for window sizes like this:

Screen Shot 2020-05-25 at 11 26 20 AM