sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
231 stars 32 forks source link

Implement gene set association analysis #692

Open eric-czech opened 3 years ago

eric-czech commented 3 years ago

We have methods for associating individual genetic variants and phenotypes (https://github.com/pystatgen/sgkit/pull/16, https://github.com/pystatgen/sgkit/pull/66), but inferring associations with gene sets and pathways would also be very useful.

In principle, these methods are fairly simple and nearly always appeal to a multivariate regression model of the individual SNPs for a gene set against a phenotype (or approximations of this based on gene-level or summary statistic data). Correlation between the SNPs and the samples, rare occurrences of variants and phenotypes, gene length and pathway size are all complicating factors though that led to a fairly long history of ever-improving methods to correct for these things.

Methods in this space support computing gene set statistics from individual phenotypes, variant-level sumstats, gene-level sumstats or some combination of those. For the sumstats cases, some reference LD panel is usually required.

Here are some links from @hammer on approaches to this problem:

Old Methods

New Methods

Application

Being a little familiar with some of these, I think the easiest to implement appears to be gene-ε (https://github.com/ramachandran-lab/genee). This requires variant-level sumstats and an LD matrix, but only works for quantitative traits and produces gene-level sumstats. This could be a useful building block though since some of the gene set methods available once you have gene-level sumstats look like they might be manageable (e.g. https://github.com/diptavo/GAUSS).

eric-czech commented 2 years ago

For aggregating variant sumstats to the gene level, one such option is PASCAL (2016). This requires an LD matrix and variant sumstats, but the calculation using it looks pretty straightforward.

eric-czech commented 2 years ago

FYI dropped a line to Lorin at https://github.com/ramachandran-lab/genee/issues/2

tomwhite commented 2 years ago

I started looking at implementing gene-ε in Python, beginning with a simple port of the R code to use Pandas and scikit-learn. I used GaussianMixture as the Python equivalent to R's mclust, and chiscore's lui_sf function as an alternative to the Imhof function in R. (I did find this implementation of Imhof in Python, however it is not available under an open source license. I was pleased to find the alternative in chiscore, and was pleasantly surprised to find it was written by @horta!)

Running on the simulated dataset I managed to get a decent match with the equivalent R outputs. The similated data has 100 genes, and there are three outputs: test_q, q_var, and pval.

To migrate this to sgkit, I think the following steps are needed:

I think there may be some overlap with #657 too, since GFF3 would be a natural way to supply gene definitions. The real example in gene-ε doesn't use GFF3 however, so it may not be needed immediately.

@eric-czech, @hammer does this sound like a good plan? I'd be interested to hear your thoughts.

eric-czech commented 2 years ago

Sounds great @tomwhite, it will be awesome having that already for validation/testing!

Before I responded, I wanted to get a better sense of how the method works. Here is an outline based on a deeper read of the code/paper with an important feature of the method being that any mention of "gene" below is interchangeable with an arbitrary grouping of variants (e.g. gene sets, protein domains, etc.):

FWIW the first way I saw to implement this would be using only the DataFrame API (and w/o any kind of windowing functions):

  1. Compute an LD matrix using sgkit (w/o a threshold, but big windows would be fine I imagine), or pull one from elsewhere as a DataFrame
  2. Add a column to that data frame indicating the variant group index (a join on i, noting that the variant could be repeated if it's in multiple groups)
  3. Load some variant-level sumstats and run the regression and gaussian mixture step to deflate them (genome-wide) as well as compute the epsilon_effect (a scalar)
  4. Add a column to the LD data frame with the deflated statistic for each variant (another join on i)
  5. Group the data frame by variant group and apply the remaining test statistic code by group (passing in the scalar epsilon)

Some kind of co-group operator for combining the variant groups, sumstats and LD matrix would definitely be more elegant (maybe via Dask Bags?), but the repetition of a few ints/floats (espilon, deflated effect, variant group) in the DataFrame likely wouldn't be a big deal.

How does that sound to you? Then all the method needs are a variant index -> group(s) and variant index -> sumstats effect mapping. Personally, I think it would be fine to assume the user generates those for now without worrying about formalizing them much.


pval is within 5% for 98 out of the 100 genes; the other two had pval < 10^-8. The differences are probably due to the use of Lui not Imhof.

Noting from our call w/ @lorinanthony yesterday that differences in the tails are expected between methods like this and that we would probably be ok to use any python equivalent for the methods in the R CompQuadForm library (or at least I think that's the one).

Gaussian mixture model. I think we can use scikit-learn for this step

100% agreed.

In the context of sgkit, I hope we can use the ridge regression implementation from our regenie implementation.

🤔 I'm not so sure about that one. Fig. 3 in the paper doesn't look very promising for ridge solutions and regenie would only be an approximate ridge solution (so presumably even worse). The software/paper recommend lasso regression. In this case the features (at least if they're from ld_matrix) would be an upper band matrix so maybe there's some way to exploit that? I'm not sure where to go on this one either than maybe to convert the whole LD matrix DataFrame to some scipy sparse COO matrix and run sklearn Lasso on it. Have any other ideas?

tomwhite commented 2 years ago

Thanks @eric-czech, great summary!

I agree that it's worth trying out an approach using the DataFrame API and I'll try that first. I'd also be curious to see what a windowing approach looks like (and how it compares to a dataframe approach in terms of performance).

The regression part is still the biggest unknown. The paper talks about using a mix of ridge and lasso, which is ElasticNet in sklearn (glmnet in R). I think we can try using that, with sparse matrices, as you suggest. (ElasticNet converts any sparse matrices to CSC form, so we could convert the LD dataframe to that format.) The paper talks about partitioning by chromosome, so that is worth doing too for larger datasets. Do you have a sense for the size of the LD matrix, and its sparsity for the data you are working with?

On the topic of sparsity, I noticed that the R implementation of genee standardizes the LD matrix. (The calls to glmnet do not include a standardize parameter, and it is TRUE by default.) What's more, the biglasso package doesn't have a way to not standardize the data.

However, I wonder if the LD matrix needs to be standardized, since it is a correlation matrix with values between -1 and 1, which means that the values are all on the same scale and mutually comparable. Of course, if we did standardize it, we turn a sparse matrix into a dense one, which would cause memory issues.

(The betas do need to be standardized, but that is easy to do and has no memory issues.)

Looking at alternative implementations, Dask has a GLM package which includes an implementation of ElasticNet, but the page says "This library is not ready for use", and it appears to be unmaintained. So let's see how far we can get with sklearn.

eric-czech commented 2 years ago

Do you have a sense for the size of the LD matrix, and its sparsity for the data you are working with?

At the very upper end of that spectrum you'd have something like the full UKBB LD matrices from massive windows like 10 Mbp reaching several (or >10) TB in size [1].

As far as sparsity goes, some estimates I had from ld_prune_ukbb_sim.ipynb showed that there are about 300 variants in a 1 Mbp window in 1KG data. A lower bound estimate would be more like 50 variants in a 100 Kbp window (from HapMap in ld_matrix.ipynb).

The paper talks about partitioning by chromosome, so that is worth doing too for larger datasets

I wonder if there's a way to do the deflation step using LD scores (the sum of r2 for a variant across all other variants)? That would be manageable to create on our own or use if pre-computed (as the Broad team does). If that's possible, then we may want to consider the deflation step as external to our gene-e implementation. We'd still need the LD matrix, but operating on that only in gene or gene-set specific groups would certainly be less expensive.

However, I wonder if the LD matrix needs to be standardized, since it is a correlation matrix with values between -1 and 1

Great question -- that would certainly affect the deflated estimates. I'll ping Lorin and Wei in a shorter comment.

So let's see how far we can get with sklearn

👍

eric-czech commented 2 years ago

Hey @lorinanthony or @weichengv, what effect does standardization (of the LD matrix) in the glmnet/lasso step of gene-e have? That would certainly alter the resulting deflated statistics so we're wondering if we should or should not do that.

lorinanthony commented 2 years ago

Hey @eric-czech, you're absolutely right that would alter the resulting set of shrunken effect sizes. For that reason, I wouldn't standardize the LD matrix. Also the LD matrix is symmetric and the traditional way of standardization might be problematic here.

tomwhite commented 2 years ago

Thanks @lorinanthony for confirming that!

@eric-czech I have implemented your suggestion using Dask dataframes here: https://github.com/tomwhite/sgkit/blob/genee/validation/gwas/method/genee/test_genee_dask.py. Still doesn't include a regression step.

tomwhite commented 2 years ago

So let's see how far we can get with sklearn.

I created this notebook to investigate: https://nbviewer.org/github/tomwhite/sgkit/blob/genee/validation/gwas/method/genee/genee-ld-simulation.ipynb

TLDR: memory is the limiting factor, but we can get quite a long way (millions of variants) by using GCP machines with >1TB of memory.

Interested to hear others thoughts as I may well have missed something.

eric-czech commented 1 year ago

From @lorinanthony on a minimum number of SNPs required per gene/region for tests:

Genes with only 1 SNP within their boundary were excluded