Implement gene set association analysis

eric-czech commented 3 years ago

We have methods for associating individual genetic variants and phenotypes (https://github.com/pystatgen/sgkit/pull/16, https://github.com/pystatgen/sgkit/pull/66), but inferring associations with gene sets and pathways would also be very useful.

In principle, these methods are fairly simple and nearly always appeal to a multivariate regression model of the individual SNPs for a gene set against a phenotype (or approximations of this based on gene-level or summary statistic data). Correlation between the SNPs and the samples, rare occurrences of variants and phenotypes, gene length and pathway size are all complicating factors though that led to a fairly long history of ever-improving methods to correct for these things.

Methods in this space support computing gene set statistics from individual phenotypes, variant-level sumstats, gene-level sumstats or some combination of those. For the sumstats cases, some reference LD panel is usually required.

Here are some links from @hammer on approaches to this problem:

Old Methods

New Methods

Application

Statistical driver genes as a means to uncover missing heritability for age-related macular degeneration (2020): uses PARIS

Being a little familiar with some of these, I think the easiest to implement appears to be gene-ε (https://github.com/ramachandran-lab/genee). This requires variant-level sumstats and an LD matrix, but only works for quantitative traits and produces gene-level sumstats. This could be a useful building block though since some of the gene set methods available once you have gene-level sumstats look like they might be manageable (e.g. https://github.com/diptavo/GAUSS).

eric-czech commented 3 years ago

For aggregating variant sumstats to the gene level, one such option is PASCAL (2016). This requires an LD matrix and variant sumstats, but the calculation using it looks pretty straightforward.

eric-czech commented 3 years ago

FYI dropped a line to Lorin at https://github.com/ramachandran-lab/genee/issues/2

tomwhite commented 3 years ago

I started looking at implementing gene-ε in Python, beginning with a simple port of the R code to use Pandas and scikit-learn. I used GaussianMixture as the Python equivalent to R's mclust, and chiscore's lui_sf function as an alternative to the Imhof function in R. (I did find this implementation of Imhof in Python, however it is not available under an open source license. I was pleased to find the alternative in chiscore, and was pleasantly surprised to find it was written by @horta!)

Running on the simulated dataset I managed to get a decent match with the equivalent R outputs. The similated data has 100 genes, and there are three outputs: test_q, q_var, and pval.

test_q is identical (to float precision).
q_var is out by a small (constant) factor, which could well be due to some systematic difference between mclust and GaussianMixture.
pval is within 5% for 98 out of the 100 genes; the other two had pval < 10^-8. The differences are probably due to the use of Lui not Imhof.

To migrate this to sgkit, I think the following steps are needed:

Regression. For the work done so far I have skipped the regression step (setting alpha=-1 uses the original untransformed summary statistics). In the context of sgkit, I hope we can use the ridge regression implementation from our regenie implementation.
Gaussian mixture model. I think we can use scikit-learn for this step, at least to start with, since it operates on the summary statistics column (size #variants), which fits in memory.
Creating gene lists (windows). We don't have a windowing function that can create windows from arbitrary intervals, but it should be straightforward to add one.
Per gene test. This step operates on the summary statistics and the LD matrix overlapping the window defined by each gene. I think we can use some of the windowing functionality developed before. In particular, I think we can use the approach in ld_matrix, which uses Dask delayed to process each chunk of the input to create a series of distributed dataframes that are concatenated into a final dataframe. I'm not sure how well this will play with a sparse LD matrix, though - perhaps that can be tackled later.

I think there may be some overlap with #657 too, since GFF3 would be a natural way to supply gene definitions. The real example in gene-ε doesn't use GFF3 however, so it may not be needed immediately.

@eric-czech, @hammer does this sound like a good plan? I'd be interested to hear your thoughts.

eric-czech commented 3 years ago

Sounds great @tomwhite, it will be awesome having that already for validation/testing!

Before I responded, I wanted to get a better sense of how the method works. Here is an outline based on a deeper read of the code/paper with an important feature of the method being that any mention of "gene" below is interchangeable with an arbitrary grouping of variants (e.g. gene sets, protein domains, etc.):

Step 1 (deflation): Run regularized or OLS regression with LD matrix as features and the original variant sumstats as outcomes. The learned weights from this will become the deflated sumstats.
Step 2 (clustering -> epsilon effect): Run gaussian mixture model to group the deflated sumstats. Identify the largest variance across all clusters. A caveat here is that the second largest variance is used when the largest variance corresponds to a cluster with more than 50% of SNPs. This largest variance is termed "SNP-level null threshold" in the paper and espilon_effect in the code.
- Note: The purpose of this threshold is to use it to ignore SNPs with effect_size**2 < epsilon_effect in the final statistical test.
Step 3 (loop): Everything that follows is scoped to SNPs for a single gene
- Step 4 (null distribution): To get the null distribution for imhof's test statistic, find the eigenvalues of the matrix defined by the SNP-level null threshold times the LD matrix of SNPs for the gene. This produces a coefficient for random chi-square variables (corresponding to each SNP) that are summed to give the null distribution for a gene.
- Step 5 (test statistic): The actual gene-e test statistic for a gene is simply the sum of squared, deflated, per-SNP effect estimates from step 1
- Note: I see the code also includes a per-snp prior_weight that is multiplied here, but I don't see that in the paper (it's all 1 by default so it has no effect).
- Step 6 (test statistic standard error): Calculate an estimate for the variance of the test statistic using the gene LD matrix and the SNP-level null threshold
- Step 7 (imhof): Feed the per-SNP eigenvalues for the null distribution and the observed, scalar test statistic to a function for imhof's method
Step 8 (aggregate): Collect all the per-gene test statistics, p-values, and variance/standard error estimates

FWIW the first way I saw to implement this would be using only the DataFrame API (and w/o any kind of windowing functions):

Compute an LD matrix using sgkit (w/o a threshold, but big windows would be fine I imagine), or pull one from elsewhere as a DataFrame
Add a column to that data frame indicating the variant group index (a join on i, noting that the variant could be repeated if it's in multiple groups)
Load some variant-level sumstats and run the regression and gaussian mixture step to deflate them (genome-wide) as well as compute the epsilon_effect (a scalar)
Add a column to the LD data frame with the deflated statistic for each variant (another join on i)
Group the data frame by variant group and apply the remaining test statistic code by group (passing in the scalar epsilon)

Some kind of co-group operator for combining the variant groups, sumstats and LD matrix would definitely be more elegant (maybe via Dask Bags?), but the repetition of a few ints/floats (espilon, deflated effect, variant group) in the DataFrame likely wouldn't be a big deal.

How does that sound to you? Then all the method needs are a variant index -> group(s) and variant index -> sumstats effect mapping. Personally, I think it would be fine to assume the user generates those for now without worrying about formalizing them much.

pval is within 5% for 98 out of the 100 genes; the other two had pval < 10^-8. The differences are probably due to the use of Lui not Imhof.

Noting from our call w/ @lorinanthony yesterday that differences in the tails are expected between methods like this and that we would probably be ok to use any python equivalent for the methods in the R CompQuadForm library (or at least I think that's the one).

Gaussian mixture model. I think we can use scikit-learn for this step

100% agreed.

In the context of sgkit, I hope we can use the ridge regression implementation from our regenie implementation.

🤔 I'm not so sure about that one. Fig. 3 in the paper doesn't look very promising for ridge solutions and regenie would only be an approximate ridge solution (so presumably even worse). The software/paper recommend lasso regression. In this case the features (at least if they're from ld_matrix) would be an upper band matrix so maybe there's some way to exploit that? I'm not sure where to go on this one either than maybe to convert the whole LD matrix DataFrame to some scipy sparse COO matrix and run sklearn Lasso on it. Have any other ideas?

tomwhite commented 3 years ago

Thanks @eric-czech, great summary!

I agree that it's worth trying out an approach using the DataFrame API and I'll try that first. I'd also be curious to see what a windowing approach looks like (and how it compares to a dataframe approach in terms of performance).

The regression part is still the biggest unknown. The paper talks about using a mix of ridge and lasso, which is ElasticNet in sklearn (glmnet in R). I think we can try using that, with sparse matrices, as you suggest. (ElasticNet converts any sparse matrices to CSC form, so we could convert the LD dataframe to that format.) The paper talks about partitioning by chromosome, so that is worth doing too for larger datasets. Do you have a sense for the size of the LD matrix, and its sparsity for the data you are working with?

On the topic of sparsity, I noticed that the R implementation of genee standardizes the LD matrix. (The calls to glmnet do not include a standardize parameter, and it is TRUE by default.) What's more, the biglasso package doesn't have a way to not standardize the data.

However, I wonder if the LD matrix needs to be standardized, since it is a correlation matrix with values between -1 and 1, which means that the values are all on the same scale and mutually comparable. Of course, if we did standardize it, we turn a sparse matrix into a dense one, which would cause memory issues.

(The betas do need to be standardized, but that is easy to do and has no memory issues.)

Looking at alternative implementations, Dask has a GLM package which includes an implementation of ElasticNet, but the page says "This library is not ready for use", and it appears to be unmaintained. So let's see how far we can get with sklearn.

eric-czech commented 3 years ago

Do you have a sense for the size of the LD matrix, and its sparsity for the data you are working with?

At the very upper end of that spectrum you'd have something like the full UKBB LD matrices from massive windows like 10 Mbp reaching several (or >10) TB in size [1].

As far as sparsity goes, some estimates I had from ld_prune_ukbb_sim.ipynb showed that there are about 300 variants in a 1 Mbp window in 1KG data. A lower bound estimate would be more like 50 variants in a 100 Kbp window (from HapMap in ld_matrix.ipynb).

The paper talks about partitioning by chromosome, so that is worth doing too for larger datasets

I wonder if there's a way to do the deflation step using LD scores (the sum of r2 for a variant across all other variants)? That would be manageable to create on our own or use if pre-computed (as the Broad team does). If that's possible, then we may want to consider the deflation step as external to our gene-e implementation. We'd still need the LD matrix, but operating on that only in gene or gene-set specific groups would certainly be less expensive.

However, I wonder if the LD matrix needs to be standardized, since it is a correlation matrix with values between -1 and 1

Great question -- that would certainly affect the deflated estimates. I'll ping Lorin and Wei in a shorter comment.

So let's see how far we can get with sklearn

👍

eric-czech commented 3 years ago

Hey @lorinanthony or @weichengv, what effect does standardization (of the LD matrix) in the glmnet/lasso step of gene-e have? That would certainly alter the resulting deflated statistics so we're wondering if we should or should not do that.

lorinanthony commented 3 years ago

Hey @eric-czech, you're absolutely right that would alter the resulting set of shrunken effect sizes. For that reason, I wouldn't standardize the LD matrix. Also the LD matrix is symmetric and the traditional way of standardization might be problematic here.

tomwhite commented 3 years ago

Thanks @lorinanthony for confirming that!

@eric-czech I have implemented your suggestion using Dask dataframes here: https://github.com/tomwhite/sgkit/blob/genee/validation/gwas/method/genee/test_genee_dask.py. Still doesn't include a regression step.

tomwhite commented 3 years ago

So let's see how far we can get with sklearn.

I created this notebook to investigate: https://nbviewer.org/github/tomwhite/sgkit/blob/genee/validation/gwas/method/genee/genee-ld-simulation.ipynb

TLDR: memory is the limiting factor, but we can get quite a long way (millions of variants) by using GCP machines with >1TB of memory.

Interested to hear others thoughts as I may well have missed something.

eric-czech commented 2 years ago

From @lorinanthony on a minimum number of SNPs required per gene/region for tests:

Genes with only 1 SNP within their boundary were excluded

sgkit-dev / sgkit

Implement gene set association analysis #692