stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
172 stars 43 forks source link

GWAS statistics and LD discrepancy due to inconsistent sample sizes across SNPs #135

Open uncreativeusername opened 2 years ago

uncreativeusername commented 2 years ago

When running susie_rss, I get an error:

Error in if ((elbo[i + 1] - elbo[i]) < tol) { : missing value where TRUE/FALSE needed

I only get this error for a small proportion of regions I'm testing, and only get it if I allow fairly high max_iter (>10,000); if I run it on, say, 1000 the IBSS algorithm does not converge. Advice on how to fix this?

zouyuxin commented 2 years ago

Hi, it will be helpful if you can share code and data for a small region to reproduce the error. How did you obtain the summary statistics and LD matrix?

stephens999 commented 2 years ago

Anecdotally, large numbers of iterations can occur when there is a poor match between the input ld matrix and the true ld matrix...

rikutakei commented 2 years ago

Hi, I'm having the exact same problem with several loci I'm working with.

I am using susieR version 0.11.43 and I have been using susie_rss to finemap loci using UKBB as LD reference.

GWAS summary data is generated by meta-analysing UKBB and other EUR ancestry data, so the samples used for LD calculation overlaps but don't completely match with my summary data (perhaps this is causing the problem as pointed out by @stephens999 )

Here is a couple of example data (one relatively standard locus and one strong locus - both return the error described in this issue) and this is what I am running to get this error:

model = susie_rss(data$z, ld, L = 10, coverage = 0.99, max_iter = 1000000)

Are there any parameters I can change (e.g. tol) to get susie to work with these loci?

Thanks for your help!

uncreativeusername commented 2 years ago

Thanks to the previous poster for providing an example! I was seeking permission to share data but this is much easier.

I am also using UKBB as the LD reference. My data are European, but not from UKBB, so perhaps the LD reference is sometimes a relatively poor match. That said, there aren't glaring cases in my examples where LD between two variants is high but zscores are quite different. How robust is the method to slight mismatches? Can you recommend a filter or test for determining if the reference is "good enough" for SuSiE to run? We're very excited about the method but need to understand its limits.

stephens999 commented 2 years ago

thanks for the bug reports - we will look at them.

One common issue to check for is "allele flips" between the z scores and the LD reference. (ie the z scores are oriented to the wrong allele compared with ld). You can check for this using the methods described in this vignette: https://stephenslab.github.io/susieR/articles/susierss_diagnostic.html The estimated value of s may also help indicate how well the reference and study match...

Also be sure that the ld matrix is R not R^2. Matthew

stephens999 commented 2 years ago

so I looked at locus33 there are some very clear discrepancies between LD and z scores and this is likely the root cause of the problems (although I don't know yet what causes the elbo computation to glitch).

eg look at this:

> locus33$data[1414,]
     CHR       BP       SNP        z
1414   2 27750545 rs2950835 16.45588
> locus33$data[1415,]
     CHR       BP       SNP        z
1415   2 27750546 rs2911711 10.09009
> locus33$ld[1414,1415]
V1415 
    1 
> locus33$ld[1414,1408]
   V1408 
0.999887 
> locus33$data[1408,]
     CHR       BP       SNP     z
1408   2 27748904 rs1313566 22.04

These 3 SNPs are in very high LD but have very different z scores. I'm not sure what the cause is, but it will make the fine mapping results very unreliable. Is it possible that the meta-analysis results uses different samples at different SNPs? It is crucial to use the same samples at every SNP when doing this kind of thing.... (eg by imputing all studies in the meta-analysis to the same SNPs)

I found these examples by using

temp = kriging_rss(locus33$data[,4],as.matrix(locus33$ld))
plot(temp$conditional_dist$cond_z)
which.max(abs(temp$conditional_dist$cond_z))

Here cond_z is a measure of how badly the observed z fits the expectation (sorry it is not very well documented yet.) Big values are suspicious.

@uncreativeusername if you can't share your data maybe this use of kriging_rss can help you check if your z scores match the LD?

zouyuxin commented 2 years ago

The similar problem happened to locus 26. There could be allele flips problem (the z scores are oriented to the wrong allele compared with ld).

>  dat$ld[3061,3061]
0.999966
>  dat$data[c(3061, 3064),]
     CHR        BP       SNP         z
3061   1 212215104 rs3010024 -1.568627
3064   1 212215819 rs3010023  2.063636

The 2 SNPs are positively correlated, but the z scores are in opposite sign.

For the elbo computation, I'm trying to reproduce the error.

uncreativeusername commented 2 years ago

Thank you for pointing me to the estimate_s_rss function. It looks like our issues are arising in cases where the s-value is elevated. Do you have a suggested threshold for which the LD reference is "too poor" of a fit for SuSiE to run reliably?

zouyuxin commented 2 years ago

@uncreativeusername @rikutakei I updated the function to give a clearer error message. We suggest using kriging_rss function to find if the z scores match the LD (see usage in https://stephenslab.github.io/susieR/articles/susierss_diagnostic.html). It's very likely to have an ''allele flip'' problem without noticing (the reference and alternative alleles are mismatched between z scores and the reference panel). The result could improve once you identify and fix the mismatch between z scores and LD matrix.

stephens999 commented 2 years ago

sorry i don't think we do have a recommended threshold. It would be nice to have one, but I'm not sure how to set it.

On Fri, Jul 30, 2021 at 1:34 PM uncreativeusername @.***> wrote:

Thank you for pointing me to the estimate_s_rss function. It looks like our issues are arising in cases where the s-value is elevated. Do you have a suggested threshold for which the LD reference is "too poor" of a fit for SuSiE to run reliably?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/135#issuecomment-890078804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRP4YHHQLOSC57ZQCKLT2LWCNANCNFSM5BAQDZHA .

rikutakei commented 2 years ago

Thank you @stephens999 @zouyuxin , I will try and fix the allele flips and also check the s-value for all the loci I'm working on.

Is it possible that the meta-analysis results uses different samples at different SNPs? It is crucial to use the same samples at every SNP when doing this kind of thing.... (eg by imputing all studies in the meta-analysis to the same SNPs)

Yes, different SNPs will have slightly different studies/samples included when meta-analysed together. All of the variants I am using have passed certain level of sample size/number of studies from the meta-analysis and I think the summary stats are of good quality, although some SNPs may be present in more studies and therefore have larger N. How much of a difference in sample size between SNPs is problematic?

I have tried using kriging_rss function before to diagnose the problem, but it gave me (and it still gives me) an error when plotting the cond_z plot - I'm not sure if you got the same error and that's why you have plotted it separately

> kriging_rss(locus26$data$z, locus26$ld)
$plot
Error: Aesthetics must be either length 1 or the same as the data (4720): x and y

Is this related to having flipped alleles too, or is it a different bug? (or maybe I just need to update susieR...)

zouyuxin commented 2 years ago

different SNPs will have slightly different studies/samples included when meta-analysed together.

One crucial model assumption is the same samples are used to compute z scores for each SNP. The correlations between z scores in the model depend on this assumption. This assumption is also required in other Bayesian fine-mapping methods. Basically, if totally different samples are used to compute z scores for two SNPs, the correlation between the z scores will be 0, which violates the model for summary statistics. So it's important to use genotype imputation to have the same samples for each SNP.

Is this related to having flipped alleles too, or is it a different bug? (or maybe I just need to update susieR...)

Please update susieR.

stephens999 commented 2 years ago

I don't know why the plot is giving you an error but the default plot produced is different than the one i plotted of cond_z. That's why i used different code above.

rikutakei commented 2 years ago

Basically, if totally different samples are used to compute z scores for two SNPs, the correlation between the z scores will be 0, which violates the model for summary statistics.

@zouyuxin I think I understand what you're saying. Yes, using z scores of two SNPs from two totally different studies/samples would be problematic, and I assume you're describing the most extreme scenario.

However, what I was trying to describe earlier was where all the SNPs have data from at least X studies/samples, but some SNPs have X + Y studies/samples. In this case, the correlation won't be 0, but it won't be perfect either, right?

So how much overlap of studies/samples is good enough? Or should I be discarding the extra Y studies/samples for fine-mapping purposes?

So it's important to use genotype imputation to have the same samples for each SNP.

I am using imputed data for the meta-analysis so majority of SNPs are present in most (if not all) of the studies. Imputation isn't perfect and there will be differences in which SNPs get imputed depending on the reference panel, imputation software, input data, etc. and hence some SNPs have different sample size in my data.

zouyuxin commented 2 years ago

However, what I was trying to describe earlier was where all the SNPs have data from at least X studies/samples, but some SNPs have X + Y studies/samples. In this case, the correlation won't be 0, but it won't be perfect either, right?

Right. The correlations need to account for those different samples.

So how much overlap of studies/samples is good enough? Or should I be discarding the extra Y studies/samples for fine-mapping purposes?

Sorry, we don't have practical suggestions for different samples used. I think it's better to discard extra Y studies/samples. See Section 5 in 'Bayesian large-scale multiple regression with summary statistics from genome-wide association studies', if you want to know more details about the impact of using different samples.