stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
184 stars 47 forks source link

Issue with credible sets purity filtering when external LD is involved #122

Open joshchiou opened 3 years ago

joshchiou commented 3 years ago

Hi there, I've been using susie_rss for fine mapping summary statistics using LD references from gnomAD. For most loci (112/134), I have been able to get credible sets, but I have been unable to get credible sets from the remaining loci.

I've noticed that for these loci, SuSiE usually converges within the 2nd iteration

res$converged

TRUE

res$niter

2

But there are no credible sets in the results

res$sets

$cs NULL $coverage NULL $requested_coverage 0.99

Any idea what's going on here? I'm happy to share the data if needed - I'm pulling from public sources.

pcarbo commented 3 years ago

@joshchiou If you could please share with us a reproducible example that results in no CSs, we'd be happy to take a look.

gaow commented 3 years ago

@joshchiou you requested a rather high coverage of 0.99. Do you find any CS if you lower it to say 0.90?

joshchiou commented 3 years ago

Thanks for the fast response. @pcarbo, I uploaded example 1 and example 2.

@gaow I have tried using the default coverage=0.95 and coverage=0.9 as well but have not found any CS.

Here is the relevant code I've been using. res <- susie_rss(data$z$z, as.matrix(data$ld), L=10, coverage=0.99, min_abs_corr=sqrt(0.1)

stephens999 commented 3 years ago

isn't it expected to find no CSs if there are no sufficiently strong signals? Is it possible that these regions do not have sufficiently strong signals?

On Wed, May 5, 2021 at 9:55 AM Josh Chiou @.***> wrote:

Thanks for the fast response. @pcarbo https://github.com/pcarbo, I uploaded an example here https://www.dropbox.com/s/iiphf5ujbzel2of/test_data.rds?dl=0.

@gaow https://github.com/gaow I have tried using the default coverage=0.95 and coverage=0.9 but have not found any CS

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/122#issuecomment-832755647, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRIWDIVH4KC5VCQBSJTTMFL65ANCNFSM44DYECHA .

joshchiou commented 3 years ago

I suppose it's possible. I have been using a list of index variants from the paper describing the original GWAS. I checked the regions that did not output CSs and a few of them are sub-genome-wide significant (P<5e-8) in the published discovery summary statistics.

That being said, most of them are above or near genome-wide significance (P<=5e-8). I did try lowering the coverage to 0.9, and it seemed to produce CSs for some, but not all of them. Is using a higher coverage (e.g. 0.99) not recommended? Many fine mapping papers report 99% credible sets, which is why I typically default to 0.99.

pcarbo commented 3 years ago

@joshchiou Thanks for sharing the example above. Among the examples that did not produce a CS, what is the smaallest p-value?

joshchiou commented 3 years ago

Here is an example 3 of one that did not produce a CS using lower coverages (coverage=0.9 or coverage=0.95). The p-value of the index variant of the region is ~1.5e-8.

Not sure whether this is related, but I am also having issues fine mapping regions with very strong signal (P<1e-300) see example 4. Although it eventually converges after 1000s of iterations, the CSs all contain exactly 1 variant which are in high LD with each other (abs(r)>0.9).

stephens999 commented 3 years ago

can you make sure you are running the latest version of SuSie from github?

On Wed, May 5, 2021 at 6:09 PM Josh Chiou @.***> wrote:

Here is an example 3 https://www.dropbox.com/s/ml45jyiv06zft0k/test_data3.rds?dl=0 of one that did not produce a CS using lower coverages (coverage=0.9 or coverage=0.95). The p-value of the index variant of the region is ~1.5e-8.

Not sure whether this is related, but I am also having issues fine mapping regions with very strong signal (P<1e-300) see example 4 https://www.dropbox.com/s/sijiffpdbj7c6xm/test_data4.rds?dl=0. Although it eventually converges after 1000s of iterations, the CSs all contain exactly 1 variant which are in high LD with each other (abs(r)>0.9).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/122#issuecomment-833106152, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRP7SQFM2GDORVNG7F3TMHF3RANCNFSM44DYECHA .

joshchiou commented 3 years ago

Sure, I'm currently running 0.11.7 (Apr 27) but I can try again with the latest commit.

zouyuxin commented 3 years ago

@joshchiou How did you get the LD matrix? I checked example 4, the SNPs in the identified CSs are highly correlated (based on the LD matrix), but their z scores are quite different. If 2 SNPs are highly correlated, their z scores should be very similar. This suggests that the LD matrix is from a poor reference panel.

stephens999 commented 3 years ago

it seems in your example 3 that the problem is that the CS is being filtered out due to low purity.

Try this: temp = readRDS("~/Downloads/test_data3.rds") temp.fit = susie_rss(temp$z[,1],as.matrix(temp$ld)) temp.cs = susie_get_cs(temp.fit,as.matrix(temp$ld),min_abs_corr = 0.0)

I do think it is a bit surprising at first site that this CS should be low purity... i will have to look a bit more at it. Matthew

On Wed, May 5, 2021 at 6:20 PM Matthew Stephens @.***> wrote:

can you make sure you are running the latest version of SuSie from github?

On Wed, May 5, 2021 at 6:09 PM Josh Chiou @.***> wrote:

Here is an example 3 of one that did not produce a CS using lower coverages (coverage=0.9 or coverage=0.95). The p-value of the index variant of the region is ~1.5e-8.

Not sure whether this is related, but I am also having issues fine mapping regions with very strong signal (P<1e-300) see example 4. Although it eventually converges after 1000s of iterations, the CSs all contain exactly 1 variant which are in high LD with each other (abs(r)>0.9).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

joshchiou commented 3 years ago

@zouyuxin I created the matrix by extracting from pre-computed LD values from gnomAD v2.1.1 NFE (Non-finnish European, N= 7,718). It's not an exact match to the GWAS (of European samples) I'm trying to fine map, but unfortunately I don't have access to the underlying genotypes.

That being said, I have noticed the same thing happening as example 4 for regions with strong signals for different traits in the UK biobank. In those cases, I was using the same samples to calculate the LD matrix.

stephens999 commented 3 years ago

If you can give us an example of this happening with an in-sample ld matrix we can look into it...

On Wed, May 5, 2021, 21:10 Josh Chiou @.***> wrote:

@zouyuxin https://github.com/zouyuxin I created the matrix by extracting from pre-computed LD values from gnomAD v2.1.1 https://gnomad.broadinstitute.org/downloads NFE (Non-finnish European, N= 7,718). It's not an exact match to the GWAS (of European samples) I'm trying to fine map, but unfortunately I don't have access to the underlying genotypes.

That being said, I have noticed the same thing happening as example 4 for regions with strong signals for different traits in the UK biobank. In those cases, I was using the same samples to calculate the LD matrix.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/122#issuecomment-833172189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRO2TB7IA535GQ225VTTMH277ANCNFSM44DYECHA .

joshchiou commented 3 years ago

Here is example 5, which eventually converges with an in-sample LD matrix (generated by plink using 341k unrelated UK biobank samples).

zouyuxin commented 3 years ago

@joshchiou Could you share more details about how did you compute the z scores? Is it about a quantitative trait or a binary trait?

stephens999 commented 3 years ago

in example 5 the LD matrix is not positive semi-definite. Possibly this is rounding error. But is it possible it is computed with missing genotypes or some other issue?

sort(ld.eigen$values)[1:10] [1] -1.1655572 -0.7359617 -0.6541895 -0.4747877 -0.3862273 -0.3661112 [7] -0.3456588 -0.3045849 -0.2319399 -0.2089683

On Thu, May 6, 2021 at 5:25 PM Yuxin Zou @.***> wrote:

@joshchiou https://github.com/joshchiou Could you share more details about how did you compute the z scores? Is it about a quantitative trait or a binary trait?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/122#issuecomment-833912689, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRI7KGPXZ4BKYY2NFGLTMMJL5ANCNFSM44DYECHA .

joshchiou commented 3 years ago

@zouyuxin In example 5, the z-scores are from a quantitative trait. Examples 1-4 are from a binary case/control trait.

@stephens999 I computed the LD matrix from imputed genotypes with plink 1.9 - rounding error is definitely a possibility, also hard calls instead of dosages (which the GWAS used) might contribute additional error.

stephens999 commented 3 years ago

to return to the original issue: the answer is that the cs is being pruned due to being low purity. You can avoid the purity filter by something like: susie_get_cs(fit,as.matrix(temp$ld),min_abs_corr = 0.0)

Removing the purity filter in this way will tend to also return some "irrelevant" CSs that contain very large numbers of SNPs, but you can make a decision about those -- they are fairly obvious. (Looking at the estimated prior variance using susie_get_prior_variance(fit) may also be helpful -- small values here may indicate irrelevant CSs).

My guess is that this problem here may be partly due to a poor LD matrix -- the SNPs in the CS are probably in higher LD than the matrix indicates, and so if we had access to the correct matrix maybe the filter would not remove this CS.

I'm leaving this issue open as it may be worth us thinking about whether the default filter needs modifying when using an out-of-sample LD matrix.... certainly we might want to document this issue as something for users to be aware of.