stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
169 stars 42 forks source link

fine-mapping using susieR with summary statistics from QTLtools or with individual genotype data? #186

Open caydenpenn opened 1 year ago

caydenpenn commented 1 year ago

Hello! I am currently working on eQTL fine-mapping analysis with susieR. I have already mapped eQTLs and performed conditional eQTL mapping with QTLtools, which provided me the summary statistics data. I also have individual genotype data. My next step is to perform fine-mapping for these eQTLs. I have several questions regarding this process.

My first question is whether I need to include all variants (such as all variants surrounding 1M bp of genes with an eQTL), or just the variants showing signatures in the conditional eQTL mapping analysis (~200 variants for each gene) with QTLtools. If I need to include all variants, it seems that I did not need to run QTLtools. Can I simply run susieR for all genes and use the results downstream? It is confusing to me whether non-fine-mapping tools are still necessary when individual genotype data is available, given that susieR is both fast and robust at identifying causal variants.

My second question is whether I need to use the summary statistics of the eQTLs (using susier_rss) or just the individual genotype data (using susier) when fine-mapping only the conditional eQTLs from QTLtools.

Lastly, I am wondering if you have suggested cutoffs for extracting causal variants from the susieR results. I plan to use the variants in the 95% credible sets as my causal candidates. However, I am unsure how to narrow down those candidates further. For example, should I set a PIP threshold greater than 0.8? In some credible sets with multiple variants, no single variant exceeds the PIP 0.8 cutoff. How should I interpret such credible sets or variants in those sets? Any guidance would be greatly appreciated. Thank you!

gaow commented 1 year ago

Hi @caydenpenn

conditional eQTL mapping with QTLtools,

Could you clarify what you meant by conditional eQTL?

whether I need to include all variants (such as all variants surrounding 1M bp of genes with an eQTL), or just the variants showing signatures in the conditional eQTL mapping analysis

You need to include all variants (after some MAF filter if you like)

Can I simply run susieR for all genes and use the results downstream?

Unless your sample size is huge, for eQTL studies with a few thousand individals (or when the number of individuals is not more than the number of variants to analyze) I would run SuSiE with individual level data directly, although you need to deal with covariates properly #82 . Alternatively, you can also run susie_rss with in sample LD, effect size, standard error and smapel size which is the sufficient statistics set of input and should give you identical result of individal level data fine-mapping

However, I am unsure how to narrow down those candidates further. For example, should I set a PIP threshold greater than 0.8?

It depends on your goal of the analysis. SuSiE credible set (CS) by definition is meant to capture a single effect ie it should contain with eg 95% probability the non-zero effect. But it is hard to tell which one it is. Of course the one with largest PIP inside the CS would be the most likely, and if this PIP is large enough eg >0.8 you will be somewhat confident to claim it is the non-zero effect SNP according to your data. However imagine you have 10 SNPs in complete LD within a CS and one of them is the causal SNP, then all your PIP should be 1/10 = 0.1 because your variants are identical. In this case, it would be hard to select the one SNP from this CS, although it should by design still captures the non-zero effect variant of interest.

caydenpenn commented 1 year ago

Hi Gao @gaow, thank you for your helpful suggestion which resolved all my questions. I will run susie for all genes with all surrounding common variants (MAF>=0.01) after controlling expression variation from covariates as you suggested.

By saying conditional eQTL mapping, I mean independent eQTL mapping which assumes several independent eQTLs in a gene (similar to the concept of L parameter in susier), while typical QTL mapping in QTLtools only report one eQTL in one gene.

caydenpenn commented 1 year ago

Hi Gao @gaow, thank you for your helpful suggestion which resolved all my questions. I will run susie for all genes with all surrounding common variants (MAF>=0.01) after controlling expression variation from covariates as you suggested.

By saying conditional eQTL mapping, I mean independent eQTL mapping which assumes several independent eQTLs in a gene (similar to the concept of L parameter in susier), while typical QTL mapping in QTLtools only report one eQTL in one gene.

caydenpenn commented 1 year ago

Hi Gao @gaow, I have another question about running susieR.

When using the function susie (i.e. susie(x,y, L = 10)), I noticed that it is very fast and only takes a minute for each gene (~15K variants are included). As far as I understand, in the susie function, the correlation of pairwise variants is automatically calculated very quickly. The correlation matrix is similar to the LD matrix but not the real LD matrix (which takes allele frequency and other factors into consideration) biologically.

However, I want to use the real LD matrix that is calculated from my genotype data. To do this, I need to calculate pairwise LD for ~15K variants surrounding a gene. Then I can use the function (i.e. susie_rss(bhat = sumstats$betahat, shat = sumstats$sebetahat, n = n,R=R, var_y = var(y), L = 10) where R is the LD matrix estimated using genotype data.

The problem is that measuring LD matrix is computationally intensive, and it would take about 16 minutes to calculate LD for ~15K variants surrounding a gene using the PLINK tool. So the bottleneck here is measuring the LD matrix when I want to apply susie_rss on a genome-wide scale.

My question is, do you know if there is any difference in power between using the correlation matrix (e.g. R=cor(X)) and the real LD matrix? If there is no significant difference, I think I could apply the susie function (i.e. susie(x,y, L = 10)) to the whole genome, which would be very fast. Otherwise, I will have to measure the LD matrix, which is a time-consuming process.

Thank you.

stephens999 commented 1 year ago

the correlation matrix and the LD matrix are essentially the same thing here, and both take a long time to compute if you have many markers. Certainly it is intractable to compute either genome-wide.

susie(x,y, L = 10) is fast because it never computes the correlation/LD matrix.

caydenpenn commented 1 year ago

@stephens999, thank you for clarifying that susie(x,y, L = 10) never computes the correlation.

Then my question for now is if there is any difference in the power of fine-mapping between using susie and susie_rss? Since susie_rss needs the input of LD which is computational intensive while susie doesn't, if the two function have similar power, I would prefer to use susie without providing the LD matrix. Thank you.

gaow commented 1 year ago

@caydenpenn I would use individual level data susie() whenever it is possible and feasible

pcarbo commented 1 year ago

@caydenpenn susie will always give results that are at least as good as susie_rss (assuming the data are the same).

caydenpenn commented 1 year ago

Cool, thank you so much for the suggestion and clarification. @gaow @pcarbo

kauralasoo commented 1 year ago

@caydenpenn A shameless plug, but we have developed a Nextflow workflow for eQTL mapping that also runs susieR on individual-level data and should handle covariates appropriately: https://github.com/eQTL-Catalogue/qtlmap.