LD matrix from PLINK 1.9 can be inconsistent with z-scores when --keep-allele-order is not specified

pietznerm commented 2 years ago

Dear SuSiE team,

I am a huge fan of your method and currently working on a pipeline to use SuSiE, susie_rss specifially, to finemap GWAS regions from UKBB data on binary traits. Although this seems to give reasonable results for most regions, some regions fail with "The estimated prior variance is unreasonably large". This seems to be specific to regions with rare variants as the lead signal in the locus. Have you made any similar experience in the past and how would your recommend to treat those regions? I have tried multiple parameter settings, none of which seem to solve the problem.

A brief word on the stats: I have used REGENIE to run GWAS for binary traits in UKBB and use a random 30k panel of white unrelated individuals to generate the LD matrix from.

Thanks, Maik

stephens999 commented 2 years ago

as you may know, this error occurs due to mismatch between z scores and LD panel. In your case there are (at least) 2 different possible reasons for mismatch:

i) although your 30k panel should be a "good" match generally, maybe sometimes not good enough for rare variants? (how rare are these variants?) This could be fixed by using a bigger panel -ideally exactly the same individuals that were used for the GWAS.

ii) (maybe) you are using logistic regression for binary traits to compute the z scores. Logistic regression may introduce non-linearities that could cause mismatch between the z scores and panel (this is speculation -- we don't know that this happens, but it seems possible). Possibly using linear regression instead of logistic regression (even though this is binary trait) could help here. Again this is more speculative.

My suggestion would be to try a bigger panel first...

pietznerm commented 2 years ago

Thank you, for your quick reply! I will use a larger reference panel and let you know whether that solved the problem.

Best, Maik

Holen Sie sich Outlook für iOShttps://aka.ms/o0ukef

Von: stephens999 @.> Gesendet: Tuesday, November 23, 2021 1:25:31 AM An: stephenslab/susieR @.> Cc: Maik Pietzner @.>; Author @.> Betreff: Re: [stephenslab/susieR] susie_rss and estimated prior variance is unreasonably large with matching LD panel (Issue #148)

as you may know, this error occurs due to mismatch between z scores and LD panel. In your case there are (at least) 2 different possible reasons for mismatch:

i) although your 30k panel should be a "good" match generally, maybe sometimes not good enough for rare variants? (how rare are these variants?) This could be fixed by using a bigger panel -ideally exactly the same individuals that were used for the GWAS.

ii) (maybe) you are using logistic regression for binary traits to compute the z scores. Logistic regression may introduce non-linearities that could cause mismatch between the z scores and panel (this is speculation -- we don't know that this happens, but it seems possible). Possibly using linear regression instead of logistic regression (even though this is binary trait) could help here. Again this is more speculative.

My suggestion would be to try a bigger panel first...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/stephenslab/susieR/issues/148#issuecomment-976035966, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARL2B3KR322EUKHCBGK5AYDUNLNPXANCNFSM5IRKZYNQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

pietznerm commented 2 years ago

Hi again, and sorry for the very late reply. Choosing a larger reference panel solved some of the problems and some regions might simply be not 'heritable' enough. However, I observed that sometimes, the SNP with the largest PIP is not part of a credible set and I don't quite understand, why this is. Does this mean, that this SNP carries all the unexplained variance?

I should add, that I have used LDstore2 to generate the LD matrix using UK Biobank BGEN files (GWAS is within UKB as well). The matrix is reported to not be symmetric and positive semidefinite. However, computing a correlation matrix with ~350k observations and ~5k SNPs seems to be a bit of too much for R...

Thanks, Maik

stephens999 commented 2 years ago

the PIP is a function of both the strength of the association and LD with other SNPs. If a SNP has a modest association and is not in LD with other SNPs it may have a moderate PIP but not be in any CS. On the other hand, if SNP has strong association but is in LD with many other SNPs it may have small PIP but be included in a CS along with the other SNPs.

pietznerm commented 2 years ago

Thanks for the quick answer, which sounds perfectly reasonable. However, in my example (happy to share stats and LD matrix), the SNP(s) selected in the second credible set are not associated with the trait, while the one with high PIP (~0.5) is not. I will try to tweak the LD matrix that goes in, and see whether this improves anything.

Thanks, Maik

RL-m commented 2 years ago

Dear SuSiE developers, I met a similar situation as mentioned here.

1) In my case, we used summary data and LD reference panel from the exact same individuals, but we still get the error message-- "The estimated prior variance is unreasonably large". In our other trials( using in-sample dosage LD), we often got warning messages about IBSS algorithm not converging. According to your replies, it seems that these two situations are both because there is inconsistency between summary statistics and LD panel. Would you please suggest some solutions for this problem? If my understanding is incorrect, please remind me. We called susie_rss function as: fitted_rss <- susie_rss(z_scores, R, L = 10, max_iter=1000).

2) Because the error message show that something was wrong with prior variance, I used the argument "estimate_prior_variance=FALSE" in susie_rss, and I got no warnings and errors. I'm not sure if I could solve the problem in this way and if I could trust the results under this command.

Thank you! Ruilei

gaow commented 2 years ago

However, in my example (happy to share stats and LD matrix), the SNP(s) selected in the second credible set are not associated with the trait

@pietznerm please feel free to email that example region to us (my address is wang.gao [at] columbia [dot] edu).

gaow commented 2 years ago

@RL-m with in sample LD the calculation is (almost) exact and we have never seen issues in that situation before. That said, to verify, do you think you can put together the data to run susie() instead with genotype matrix and phenotype vector (or residual phenotype after regressing out covariates, if necessary) and see what happens?

stephens999 commented 2 years ago

yes, @RL-m to confirm what @gaow said, If in-sample LD matrix is used (with no missing genotype data) and everything is being computed correctly it should not happen. So it suggests there is a problem with the pipeline being used to produce the LD matrix and summary data.
What software are you using to compute LD matrix and z scores? I would double-check that pipeline.

Guan-Wang-123 commented 2 years ago

Hello all,

I've also encountered unreasonably large estimated prior variance issue using the summary statistic data.

Briefly, I used imputed vcf.gz to compute the in-sample LD matrix using PLINK v1.07 (--r --matrix functions) and to generate beta and se values for calculating z_scores following SNPTEST2.

After setting "estimate_prior_variance=FALSE" in susuie_rss(), I received warnings that "IBSS did not converge in 100 iterations" and then run the diagnostic plots to check consistency between summary statistics and LD matrix as suggested. The s estimate is ~0.59. I noted 25 variants had log LR > 2 and abs(z) > 2, and then checked the reference alleles of these 25 variants; no allele mis-match found between the SNPTEST results and the PLINK .bim file. 10 of the 25 variants are in the same gene encompassing the most significant GWAS signal from this study.

To ensure no allele mismatch, I also updated the reference allele (using --update-ref-allele in PLINK 1.9) when generating the LD matrix. I rerun susie diagnostic plot, 26 variants were marked in red this time (25 are the same as above). Results from susuie_rss() are identical.

This dataset tested above is a ~1Mb region containing 2590 SNPs and 175 samples (admixed population).

Could you spot anything unusual from the above that may explain the prior variance issue? Thanks.

Guan

RL-m commented 2 years ago

@gaow Thanks for your suggestion. I used susie() with adjusted genotype and phenotype and most of the results converged.

1). I double checked this issue using the same data on both individual level (run susie()) and on summary level with in-sample LD (run susie_rss()) . All susie() results were converged, while error occurred when running susie_rss() with summary statistics, showing that "The estimated prior variance is unreasonably large". In this test, I used plink1.9 --r square spaces command to generate LD matrix. The following function was used to generate z scores: gwas <- function (X,y) { betaList <- c(); seList <- c(); for (i in 1:ncol(X)) { regression <- lm(y ~ X[,i]); beta <- regression$coefficients[2]; se <- summary(regression)$coefficient[2,2]; betaList <- append(betaList,beta); seList <- append(seList,se) } return(data.frame(beta=betaList,se=seList)) }

2). @stephens999 Thanks for your reply. In former tests, I used fastGWA to generate summary data from UKB phenotype and genotype. Plink 1.9 was used to compute a 10K in-sample LD reference.

gaow commented 2 years ago

--r square

correlation should not be squared ... that's where the problem is.

RL-m commented 2 years ago

--r square

correlation should not be squared ... that's where the problem is.

It's not because of this problem. In plink 1.9, this --r square spaces command does compute correlation rather than squared correlation and "square" only makes the output a matrix shape. "--r2" calculates squared correlation, which I didn't use.

gaow commented 2 years ago

@RL-m you are right! And your code (in your email earlier today) looks fine to me.

Could you email me the entire script you use to reproduce the issue? I could indeed try to write the code to load PLINK data to R etc but that will take some time. It's the best if you can send me enough information so I can reproduce your problem in a minute. That will greatly help (and motivate) me to try fix it for you.

RL-m commented 2 years ago

Hi all,

I solved the inconsistency error between summary statistics and in-sample LD in my data. My problem is due to the process of computing LD matrix. Plink will automatically flip the alleles to make minor allele always be the effect allele, which will generate different effect alleles between summary statistics and LD matrix (thus inconsistent beta and LD).

Simply adding --keep-allele-order in plink commands will solve the problem. Another solution is to keep the computed LD and flip the beta sign when effect allele frequency is larger than 0.5.

gaow commented 2 years ago

@RL-m this is great / important to realize! PLINK 2 which is part of our (at least my) standard data processing pipeline should not have that problem ... I'm changing the title of the ticket to document what you found and close the ticket.

stephens999 commented 1 year ago

this is very helpful. We should maybe add some comment to a susie vignette that this option should be considered if LD is computed using plink?

On Wed, Jun 22, 2022 at 9:34 PM RL-m @.***> wrote:

Hi all, I solved the inconsistency error between summary statistics and in-sample LD. My problem is due to the process of computing LD matrix. Plink will automatically flip the alleles to make minor allele always be the effect allele, which will generate different effect alleles between summary statistics and LD matrix (thus inconsistent beta and LD). Simply adding --keep-allele-order in plink commands will solve the problem. Another solution is to keep the computed LD and flip the beta sign when effect allele frequency is larger than 0.5.

— Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/148#issuecomment-1163856642, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRLRZWCK6HYTW22L7KLVQPEK5ANCNFSM5IRKZYNQ . You are receiving this because you were mentioned.Message ID: @.***>

pcarbo commented 1 year ago

Thanks @RL-m. Can you tell us which version of plink you are using? (My understanding is that this is different in PLINK 2.0.)

RL-m commented 1 year ago

Thanks @RL-m. Can you tell us which version of plink you are using? (My understanding is that this is different in PLINK 2.0.)

I used PLINK 1.9 --r square spaces to generate LD matrix. To my understanding, PLINK 2.0 can only generate LD r2 between assigned SNP pair. If there is a better way of making LD matrix, please kindly tell me. Thank you!

pcarbo commented 1 year ago

PLINK is fine, but one has to be a little careful. We have also used LDStore, or, when the genotype data matrix is not too large, we load the genotypes into R and use cor.

pcarbo commented 1 year ago

I've updated the vignette with this info. Thanks!

stephenslab / susieR

LD matrix from PLINK 1.9 can be inconsistent with z-scores when --keep-allele-order is not specified #148