PyData prototype LD prune implementation

eric-czech commented 4 years ago

To make this prototype more compelling, we should add implementations of some of the more challenging algorithms on the critical path. LD pruning is a great example of that.

eric-czech commented 4 years ago

I added an implementation using cuda.jit in tsgpu_backend.py. Some notes on this:

notebook comparing PLINK to a GPU implementation
It's roughly as fast as PLINK, maybe faster if I added on-chip memory preloaders for thread blocks (which shouldn't be too hard as written)
My earlier experiments with jit compiling for CPU implementations were far slower (>=10x), even with many more optimizations and tricks
My GPU costs about the same as my CPU so this is encouraging

Specifics on the implementation:

It does not support bp windows
- I can't think of a way to do this that doesn't require either making threads do very unequal amounts of work OR needing to represent the row correlation matrix in GPU memory, and both of those suck
- Since the implementation increases feasible window sizes a lot, this could possibly be elided by finding the window size (in num variants) that corresponds to the desired bp window size in some strategically chosen part of the genome with lower variant density
It does support selection based on some arbitrary score (MAF typically)
It runs all comparisons in the windows rather than skipping them where possible as PLINK does. This means more low frequency variants are pruned and that the order in which the variants are compared does not matter.

hammer commented 4 years ago

It does not support bp windows

I'm a little confused by what this means. I see window = 50 passed as a parameter in your notebook. Is that parameter ignored right now?

eric-czech commented 4 years ago

Is that parameter ignored right now?

Nope, it's used but it corresponds to a window size in number of variants. This isn't an uncommon thing to do -- it's what Marees does in the tutorial paper and what scikit-allel does -- but UKBB and Hail use bp windows (PLINK provides either option). I'd rather support both though, and now we know it might actually be worth the effort to figure out how.

hammer commented 4 years ago

Ah okay, thanks for the clarification.

From the notebook:

A single non-chunked array is used until a good solution to dask#2403 makes it possible to do the block overlap required here.

What do we gain once dask#2403 is implemented? Why is array chunking useful here? Will we get faster runtimes due to on-GPU parallelism, or can we use GPUs with less on-GPU memory because we can page blocks in and out, or something else?

Sorry for the dumb questions, I should probably make more time to understand these algorithms better given how much you've written on them already.

eric-czech commented 4 years ago

What do we gain once dask#2403 is implemented?

That one is just about the correctness of the algorithm, not any kind of performance thing. If you imagine a chunked array that is only chunked horizontally (i.e. tall and skinny), then the rows near the bottom of any one block need to be compared to the rows in the top of the block below it since the window spans blocks. The map_overlap function is perfect for sharing some information between blocks near the boundaries, but it only does it for a single array. In this case we need the genotype calls as well as contigs, position, and MAF (or some other score) column vectors to be overlapped too since they're all part of the comparison logic.

I should probably make more time to understand these algorithms

No worries, the two minute explanation probably saves a lot of time and it's good to have these little blurbs somewhere IMO

eric-czech commented 4 years ago

Oh but it does definitely mean that if the arrays are chunked, then we can process big datasets by passing individual blocks to a GPU (so we can size the blocks to fit GPU memory). Sorry, I think I breezed past your point there -- this is definitely crucial for making this algo work for any data bigger than GPU memory.

hammer commented 4 years ago

the window spans blocks

Got it. We had a similar issue when a single MapReduce record was split across 2 HDFS blocks, e.g. https://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries.

eric-czech commented 4 years ago

Some updates on this:

I added support for either base pair or fixed size windows + steps as well as selection of maximally independent variant sets, so we can do what PLINK does now
There are Dask-based CPU and GPU backends for the LD calculations, the most expensive part by far
Parallelism for each step:
- Determining which rows are in the same windows - parallel by chromosome
- Calculating LD - parallel by chromosome + row block
- Getting an MIS - parallel by chromosome
Comparison to Hail
- This diverges quite a bit from hail.ld_prune
- The Hail implementation cheats (IMO) by running a completely separate LD pruning on each partition of a dataset (that makes no use of MAF) before running a block-scalable LD prune on the results (that kind of uses MAF the same way as PLINK).
- This is a quick addition to the current CPU-based implementation if we have enough confidence to do it at some point though
Notes on benchmarking w/ full 1KG dataset:
- single-threaded GPU times are comparable if not faster than running CPU version using either 8 or 16 cores
- Hail ld_prune is ~3.5x faster =( (Hail takes ~10 minutes, this takes 35)
- Adding the local pruning step would likely make a big difference though
Notes on testing:
- Pytest fixtures make a bunch of tests for a single method easy to run across all backends for that method (e.g. test_ld_matrix.py)
- Hypothesis makes it easy to generate a bunch of call data and parameterizations for ld_prune to check exact equality of results vs scikit-allel (e.g. test_ld_prune.py)
I burned a good bit of time trying to match results exactly against PLINK after implementing what Chris described here, but couldn't figure out why I'm getting discrepancies yet. It could be somewhere in these 14k(!?) lines of code related to LD calculations/pruning: https://github.com/chrchang/plink-ng/blob/master/1.9/plink_ld.c -- how he keeps that all straight is a miracle. Or it could be minor differences in R2 calculations, so a comparison of these across Hail, plink, and this (R2 calculator lifted from scikit-allel) may be worth it at some point.
This notebook walks through pruning 1KG data before running PCA

eric-czech commented 4 years ago

I also added a notebook running LD prune on simulated UKBB data. It was easy to get a representative sampling in this case by concatenating row vectors from 1KG since LD is preserved that way. It also runs only using chr 8 variants near the 5' telomere, which is more dense than any other chromosome. Altogether, it looks like it would take at most ~22 hours to do on my workstation (and would equate to about 352 single CPU hours).

eric-czech commented 4 years ago

This is mostly done now, though there are several areas for optimization that should be pursued:

Pre-compute squared vectors for LD calcs in loops where it's easy to do so
Find a way to use some BLAS/LAPACK implementation to do the LD calcs in batches
- Right now, I'm using the scikit-allel definition of R2 which has some special considerations for missing values but is afaict equivalent to pearson correlation if rows were normalized first
Tryout numba optimizations like fastmath and parallel
- fastmath activates special Intel SVML optimizations if the icc_rt package is installed
- fastmath does instruction pipelining so presumably this an alternative to trying to convert calculations to something that uses native BLAS/LAPACK implementation

eric-czech commented 4 years ago

Well I tried the fastmath option at least and it makes a huge difference on our current R2 function (it's a great fit for instruction pipelining). Here is a notebook trying this as well as the parallel option. It's about 7x faster with that flag turned on.

Unfortunately, that 7x improvement only translates to a 2x speedup on UKBB/1KG LD prune/matrix benchmarks, but that does mean I can safely say it should only take around 11 hours to prune UKBB on one machine. There must be a good number of other things going on in the ld_matrix function that are relatively slow compared to the R2 calculation part.

eric-czech commented 4 years ago

Closing this now as it and the upstream parts are correct and fairly well documented/tested. Any more comprehensive testing should follow from https://github.com/related-sciences/gwas-analysis/issues/31 and I'll leave off any further optimization until it seems necessary. To re-iterate, the biggest room for optimization by far would come from simply skipping R2 calculations in windows once a single variant pair above the threshold is found (like Hail) leading to results that will differ based on the partitioning, and be kind of questionable.

eric-czech commented 4 years ago

For reference, downstream of some moderate level of variant QC, we can expect variant counts for window sizes like this:

HapMap - ~500 variants per 1Mbp window
1KG - ~2,500 variants per 1Mbp window, up to 4k at extremes (notebook); distribution by contig:

Screen Shot 2020-05-25 at 11 26 20 AM

related-sciences / gwas-analysis

PyData prototype LD prune implementation #26