stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
184 stars 47 forks source link

Variant with largest PIP is not included in credible sets #189

Closed VitorAguiar closed 1 year ago

VitorAguiar commented 1 year ago

Hello,

is it expected that the SNP with the largest PIP is not included in the credible sets?

For the plots below, I've used a 500kb region from a GWAS with N ~ 11K (4K patients, 7K controls) and LD matrix from 1000 Genomes Project (N = 503). I asked for a coverage of 90%. The CS in green has P = 93.0%, and the CS in blue has P = 95.3%. The SNP on the top has PIP = 0.66.

a b
susie zscores
pcarbo commented 1 year ago

@VitorAguiar It is possible one or more CSs are being removed due to low purity. Try setting min_abs_corr in susie or susie_rss to a smaller number (e.g., 0.1) and see if that fixes the issue.

VitorAguiar commented 1 year ago

Hi @pcarbo, setting min_abs_corr = 0.1 does not fix it. However I found something interesting.

For the plot above, I used L = 3. If I simply use L = 5 (and back to the default value for min_abs_corr), that SNP on the top now has PIP=0.949 and it's the single variant in a new CS3 (blue).

Therefore, with L = 5 I have 3 credible sets, whereas with L = 3 I have 2 credible sets. I don't understand how that can happen, I mean, using L = 3 should also reveal 3 credible sets, right?

susie_region9

pcarbo commented 1 year ago

I agree that should not happen. Can you try with L = 3 and refine = TRUE?

VitorAguiar commented 1 year ago

@pcarbo, using susie_rss(z, ld, n, coverage = 0.9, r_tol = 1e-05, L = 3, refine = TRUE) does not fix it, I still get 2 CSs and the top variant is not captured by any CS.

stephens999 commented 1 year ago

what is the ELBO value for the different runs?

(In rare cases it is possible that increasing L can lead to finding a better solution...)

Matthew

On Wed, May 24, 2023 at 1:03 PM Vitor @.***> wrote:

@pcarbo https://github.com/pcarbo, using susie_rss(z, ld, n, coverage = 0.9, r_tol = 1e-05, L = 3, refine = TRUE) does not fix it, I still get 2 CSs and the top variant is not captured by any CS.

— Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/189#issuecomment-1561710025, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRN7PLITAKRC2YQI3J3XHZEN3ANCNFSM6AAAAAAYMPS35M . You are receiving this because you are subscribed to this thread.Message ID: @.***>

VitorAguiar commented 1 year ago

Hi Matthew,

For L = 3:

-15455.63 -15454.81 -15454.76 -15454.76 -15454.75 -15454.75

For L = 5:

-15454.91 -15453.16 -15452.94 -15452.91 -15452.9 -15452.9 -15452.9

Now I just noticed that if I use L = 5 and refine = TRUE, the algorithm converges in 2 iterations, and I'm back to 2 CSs and the top PIP variant (PIP=0.88) is not included in any CS. The ELBO in this case is:

-15451.95 -15451.95
pcarbo commented 1 year ago

Can you share a plot of z-score vs. variable? (Same as the PIP plot except that you are showing the z-scores instead of the PIPs.)

VitorAguiar commented 1 year ago

@pcarbo I suppose this is the plot you want to see:

susie_region9_L5_pip

VitorAguiar commented 1 year ago

Thank you all for being so helpful! I just want to add an additional analysis.

Instead of defining a ±250kb window around a lead SNP as I did above, I also tested a larger window corresponding to the LD block in Europeans (as it was done by Kundu et al, Nature Genetics, 2022; see "Methods: Locus definition"). Then I ran susie_rss(z, R, n, coverage = 0.9, r_tol = 1e-05, L = 5).

In the plot below, the vertical lines correspond to the small window tested in my previous posts.

susie_LDregion9_L5_pip

One CS is found outside the smaller window, and the right-most half of the smaller window in not included in the LD-based window, but if I compare the PIPs for the SNPs included in both datasets (~460 SNPs), I find this:

test_plot

So, the SNP with high PIP not included in any CS when L = 3 in my previous posts now has a low PIP of 0.01!

pcarbo commented 1 year ago

@VitorAguiar Regarding the z-scores, yes as you can see the third CS does not have very strong support (the first two CSs have SNPs which lager z-scores), which I think explains why you are usually only getting two CSs. So in the end the susie fit with two CSs is probably the right result (even if the PIP for that one SNP in the third CS is high).

VitorAguiar commented 1 year ago

@pcarbo That is interesting. So, would you say that besides PIP I should also put a threshold on the z-scores of variants in a CS to be confident of the CS?

The results of Weissbrod et al, Nature Genetics, 2020 comes to my mind. If I remember correctly, they used functional annotations to tune the prior probabilities at each SNP for Susie, and discuss that 10% of the 95% CSs that they find are not GWAS genome-wide significant, and the reasons for that could be (1) GWAS threshold is too stringent; (2) hidden effects by linkage masking might be revealed by Susie modeling; and (3) functionally-informed fine mapping can identify true weak effects.

pcarbo commented 1 year ago

So, would you say that besides PIP I should also put a threshold on the z-scores of variants in a CS to be confident of the CS?

In principle, you shouldn't have to since susie should automatically determine the number of CSs so long as estimate_prior_variance = TRUE. You could also look at the log-Bayes factor (lbf) output to assess the support for the CS.

stephens999 commented 1 year ago

going back over the thread i realize i didn't look at the initial post carefully enough. The short answer to the question "is it expected that the SNP with the largest PIP is not included in the credible sets?" is that this is not that unexpected.

eg If you have a "borderline significant" SNP that is not in LD with much else, then that SNP may have a moderately large PIP (eg 0.5) but not be included in any CS.

VitorAguiar commented 1 year ago

I see. Thank you all for this incredible interaction. I learned that I need to be careful and explore different values for L and genomic window sizes, and check the z-scores and lbf to interpret SNPs that are or are not included in a CS.

I worry though that many people are just reporting 95% CSs, and it seems that results can change a lot depending on how you run the analysis.

pcarbo commented 1 year ago

I worry though that many people are just reporting 95% CSs, and it seems that results can change a lot depending on how you run the analysis.

We share this sentiment! Good documentation can help but as you can see it is work-in-progress…