stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
169 stars 42 forks source link

fine-mapping with BOLT-LMM GWAS summary statistics #199

Closed xliaoyi closed 9 months ago

xliaoyi commented 9 months ago

Dear SuSiE Team,

Thanks for developing such a fantastic package and for the rigorous statistical work that underpins it.

I have a question about using SuSiE for fine-mapping with GWAS summary statistics generated by BOLT-LMM. When I run diagnostics using a matched LD matrix, I notice that the observed z-scores are not perfectly correlated with the expected z-scores. I suspect this discrepancy may be due to the covariates I included in the BOLT-LMM analysis. In light of this, can the fine-mapped credible sets still be considered reliable?

Here is the diagnosis plot: kriging_rss_plot

Here is my code:

library(susieR)
library(tidyverse)

set.seed(1)

ld_matrix <- read.csv("SUSIE_OUTPUT/susiex_both/test_ld_matrix.ld", sep = '\t', header = F)
ld_matrix <- as.matrix(ld_matrix)

# get phenotype data
pheno <- read.csv("GWAS_INPUT_DATA/hip_select_pheno_gwas_cm_230713.txt", sep = " ")

# get filtered gwas sumstats
sumstats <- read.csv("CODE/SUSIE_CODE/filtered_gwas_sumstats.txt", sep = "\t")
z_scores <- sumstats$BETA / sumstats$SE
susie_plot(z_scores, y = "z")

res <- kriging_rss(z_scores,ld_matrix)
plot(res$plot)

lambda = estimate_s_rss(z_scores, ld_matrix, n=30324)

fitted_rss1 <- susie_rss(bhat = sumstats$BETA, shat = sumstats$SE, n = 30324, R = ld_matrix, var_y = var(pheno$pelvic_inlet_width), L = 10,
                         estimate_residual_variance = TRUE)

summary(fitted_rss1)$cs
susie_plot(fitted_rss1, y="PIP")

Your tutorial suggests that using individual-level data yields better results than summary statistics. I am interested in exploring this approach but am unclear about the structure of the genotype matrix as described in your tutorial. Is it a matrix with patient IDs as row names, SNPs as column names, and filled with values 0, 1, or 2 to indicate the genotype? If so, how should I handle missing values in this matrix? Additionally, should I first residualize the phenotype y using the same covariates that were included in the BOLT-LMM analysis, given that the lead SNPs are derived from BOLT-LMM results?

I look forward to your guidance on these matters.

Best regards, Liaoyi

pcarbo commented 9 months ago

I suspect this discrepancy may be due to the covariates I included in the BOLT-LMM analysis.

We haven't systematically studied the use of LMM summary statistics in susie, but it might help to instead use the LD matrix that "corrects" (removes the linear effect of) the covariates.

xliaoyi commented 9 months ago

I suspect this discrepancy may be due to the covariates I included in the BOLT-LMM analysis.

We haven't systematically studied the use of LMM summary statistics in susie, but it might help to instead use the LD matrix that "corrects" (removes the linear effect of) the covariates.

Thanks for your prompt reply! May I ask is there any software can generate such LD matrix which has covariates corrected? The LD matrix I am using was generated by plink.

stephens999 commented 9 months ago

Is it a matrix with patient IDs as row names, SNPs as column names, and filled with values 0, 1, or 2 to indicate the genotype?

Yes

If so, how should I handle missing values in this matrix?

Impute missing genotypes using impute, beagle etc

Additionally, should I first residualize the phenotype y using the same covariates that were included in the BOLT-LMM analysis, given that the lead SNPs are derived from BOLT-LMM results?

Probably a good idea yes