Susie_rss determining number of credible sets for region with high LD and large number of expected credible sets

carolynmcgrail commented 2 years ago

Hello,

I have been using SusieR to fine map summary statistics in the MHC region using imputed genotypes and a matching LD matrix. This region has many variants with particularly large effects/ z scores and when running the diagnostic on it, the plot looks relatively linear(image below), however the s fit is between 0.1 and 0.2. Because the z scores can be upward of 40, would this cause the observed vs. expected z scores to have a larger variance (eg. some of the discrepancies are from an even larger expected z score). What would be considered a good fit for the LD matrix considering it is from the same cohort as the summary statistics. The LD matrix was made with plink, but I have checked for allele flips which don't appear to be a problem. I am not sure why the LD would not be a better fit given it is from the same cohort and I think this may be causing issues with credible sets.

While running Susie_rss, the standard L=10 has many credible sets with only one variant, but those with multiple exhibit the same/nearly identical z score and have a high mean.abs.corr (either 1.0 or > 0.95). The z scores do appear unique within each credible set so I am not sure if due to the imperfect LD matrix, that different credible sets may contain variants from the same signal. I have also tried L= 15,20,25 as these produce similar credible sets that contain that a total number of credible sets =L with still a high absolute correlation within each credible set. I have noticed as I increase the number of credible sets, there are more credible sets with log10BF = INF, whereas with L=10, only the first credible set with the highest z score=40 has log10bf=INF. As I increase L, I hav also increased the max_iter and these have all converged, but at a higher number of iterations in the thousands.

What parameters would help to determine an ideal credible set number for regions like the MHC with large effects and high LD patterns where there is expected to be a large number of credible sets ? I have been trying to increase L until the absolute correlation/purity approaches 0.5, but is there a better way to determine the ideal number to ensure the credible sets contain all the unique signals in the region? Thank you for your help!

Thank you, Carolyn

SusieR_exp_vs_observed_plot

pcarbo commented 2 years ago

@carolynmcgrail It is possible there are some numerical issues here due to (1) large effects, and (2) complex LD. We don't have much experience fine-mapping the MHC with susie but in principle it should be possible. Could you please share some snippets of code so we have a better idea of how you are calling susie_rss?

carolynmcgrail commented 2 years ago

@pcarbo for susie_rss, I am calling:

susie_rss(sumstats$zscore, R, L = 35, max_iter = 50000, r_tol = 1e-04, min_abs_corr = 0.01,verbose=TRUE,check_prior= FALSE)

The plink LD seems to have some small negative eigenvalues, but pass this r_tol threshold. Also, with the updated SusieR, I am receiving the error : "The estimated prior variance is unreasonably large" so I added check_prior= FALSE for it to run, but is this due to large z-scores as well?

Overall, there are credible sets that pick up well known variants in the region, but I am curious if there is a way to optimize LD and cred sets L parameter to ensure coverage and accuracy of credible sets in the region. I have set it to 35 in this example as that is the number of stepwise conditional signals I have seen in this dataset. thank you!

stephens999 commented 2 years ago

@carolynmcgrail when you say:

What would be considered a good fit for the LD matrix considering it is from the same cohort as the summary statistics.

what do you mean by the "same cohort"? Do you mean the exact same samples (ie what we call "in-sample" LD matrix)? If this is the in-sample matrix then I think there is a problem somewhere in computing either the matrix or the z scores --- from experience the in-sample LD matrix should provide a much better fit to the z scores, with s approximately 0.

The error you are are getting about "The estimated prior variance is unreasonably large" is not to do with large z scores, but indicates a mismatch between the LD matrix and the summary data. I would not trust the results at all...

Can you get the in-sample LD matrix or run susie on the raw data rather than using summary data?

carolynmcgrail commented 2 years ago

@stephens999 I have run the LD matrix two ways, both in-sample using the exact same samples as the summary statistics run using EPACTS. The first way is with plink bfiles calling the same allele order as the summary statistics:

plink --bfile data --r --a2-allele data.bim 6 2 --matrix --out data_ld

I tried another way described by aob93 in issue: https://github.com/stephenslab/susieR/issues/105 using the plink.raw file which looked very similar in s diagnostic fit (s~0.2) and correlation values:

plink <- read.table(file="data.plink.raw",header=T) N <- dim(plink)[1] X <- as.matrix(plink[,-c(1:6)]) X <- scale(X, center = TRUE, scale = TRUE) ld <- t(X)%*%X/N

Would there be an issue with the summary data Z scores as the beta effects appear to match literature for known risk variants? The observed vs. expected z scores tend to diverge more towards the extreme ends when running the LD diagnostic. thank you

gaow commented 2 years ago

@carolynmcgrail doing what you described with PLINK is a good move, but still not 100% safe against issues such as when you have a A/T or C/G allele AND a strand flip, in which case it's hard to determine if the order of the genotype data actually matches the summary stats. However if you use in-sample LD it should be very safe.

Since you seem to have the actual genotype and phenotype data, is it possible that you try using susie() function instead with input X and Y (after regressing out covariates, if any)?

Another diagnosis for your result is with check_prior=F you get the results, and run susieR::get_cs_correlation() to check correlation between your CS (I assume many of your CS has one variant anyways). If you see, for example, correlation between your CS is near -1, but your z-score for variants in those CS are very similar to each other, you'll know that there is obviously a mismatch due to allele flip, one of the most common mismatch we see when using external reference LD. I'm curious why things still do not work when you use in sample LD, but this check might give us some ideas.

stephens999 commented 2 years ago

I also don't know what the problem is. I agree with @gaow that running susie with original genotypes and phenotypes is safest. But I am curious to know what the problem is....

For your summary data did you correct for any covariates/relatedness and/or use a logistic regression model? If so I think it would be helpful to try computing the summary data without correcting for covariates and using a linear regression model (even if your phenotype is binary) just to check whether you still have these problems.

carolynmcgrail commented 2 years ago

Thank you for your feedback. I ran a logistic regression with sex and the first 4 PCs as population covariates. I will try to run the susie function with the genotype data and without these covariates first. I assume the matrix X is the same as to how I generated it above from the raw plink file, but from the documentation, I was confused what the y parameter (The observed responses) would be for a binary phenotype, just 1 or 0 for disease or control for each sample? Thank you.

stephens999 commented 2 years ago

yes, that is correct y would be 1 or 0 for case/control.

pcarbo commented 2 years ago

One idea: For troubleshooting you could try loading part of the genotype data into R (a subset of samples or SNPs, or both), and use compute_suff_stat to compute the LD matrix, and compare to the LD matrix computed using PLINK.

carolynmcgrail commented 2 years ago

Thank you all for your input, I reran everything with susie() regression using the genotypes and phenotypes matrices as inputs both with and without a covariates matrix. It ran much faster (few hours vs. days), but I am still not sure how to determine at which value of L=credible sets the results would be considered a good fit for the data. The data converges at L=10, but when setting L to 50+, it produces 37 credible sets (PIP plot below). For the 37 credible sets, the first two CSs have LNBF=INF, but these are the most significant and have Z scores > 60. Otherwise, the LNBF range from 4-206 and while 17/37 credible sets have only one variant in the CS, more now have quite a few variants. The mean.abs.corr is still typically either 1.0 or 0.99, but in the last 5 CSs, they range from 0.75 to 0.95.

I also ran @pcarbo suggestion and ran compute_suff_stat and made the LD matrix using:

suff_stat_LD_standard<-compute_suff_stat(X,y)
R= with(suff_stat_LD_standard,cov2cor(XtX))
s_suff_LD = estimate_s_rss(sumstats$zscore, R)
plot_suff_LD = kriging_rss(sumstats$zscore, R, s_suff_LD )

the s statistic s_suff_LD for the fit was similar to the plink matrix : 0.23 s_suff_LD vs. 0.21 s_plink. the plots also looked nearly identical so I think the LD matrix made with plink is pretty similar to the LD matrix made within susie. (below is the s_suff_LD plot)

Some of the variants and credible sets obtained from the susie regression overlap with the those from the summary statistics, but would be curious about your input on why these had differences. Thank you again for your help!

gaow commented 2 years ago

@carolynmcgrail thank you this is very helpful! I wonder if you could post the exact code you used to generate results for the first figure?

carolynmcgrail commented 2 years ago

Of course,

From the plink .raw file, I generated the X genotype matrix scaled and centered as above and the phenotype matrix:

plink <- read.table(file="data.plink.raw",header=T)
 X <- as.matrix(plink[,-c(1:6)])
 X <- scale(X, center = TRUE, scale = TRUE)
 y <-as.matrix(plink[,c(6:6)])

I then regressed out the covariates matrix Z of sex and the first 4 PCs and used the out$X and out$y as the inputs for the regression with 99% CSs (very similar to the 95% CS coverage I ran prior to this)

remove.covariate.effects <- function (X, Z, y) {
  ### include the intercept term
  if (any(Z[,1]!=1)) Z = cbind(1, Z)
  A   <- forceSymmetric(crossprod(Z))
  SZy <- as.vector(solve(A,c(y %*% Z)))
  SZX <- as.matrix(solve(A,t(Z) %*% X))
  y <- y - c(Z %*% SZy)
  X <- X - Z %*% SZX
  return(list(X = X,y = y,SZy = SZy,SZX = SZX))
}

out = remove.covariate.effects(X, Z, y[,1])
res_37_99percent = susie(out$X, out$y,L = 37, coverage = 0.99)
susie_plot(res_37_99percent,"PIP")

stephens999 commented 2 years ago

Hi Carolyn, this is very interesting. Regarding:

I am still not sure how to determine at which value of L=credible sets the results would be considered a good fit for the data.

In principle SuSiE does that for you, as long as you set L "big enough". In your case you saw that L=10 was not big enough, so you correctly increased L, and got a result with 37 Credible Sets. Given you used L=50 and got <50 sets it suggests L=50 was "big enough".

However, your results are quite unusual: I have never seen a fine mapping result with so many CSs. So we should be a little cautious about the results. I think it would be prudent to do some simulations to match your (unusual) data and run susie on these simulations to check that the results make sense. We have done a lot of simulations, but not with 37 effects in a region (and also not so many with binary phenotype, which is what you have here.)

We also have binary versions of susie in development that we could try out here to see if that makes a difference. If you are able to share your genotypes in this region with us (no need for phenotypes) we could help with the simulations. If that interests you please follow up by email with me.

gaow commented 2 years ago

@carolynmcgrail sorry I dropped it yesterday -- I meant to follow it up along a similar line as we may be able to try something as well on our end -- please include me in the email if you send one out :)

stephenslab / susieR

Susie_rss determining number of credible sets for region with high LD and large number of expected credible sets #151