stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
182 stars 47 forks source link

Also encountered "The estimated prior variance is unreasonably large" using ieugwasr ld_matrix_local function to calculate the ld matrix #223

Open laleoarrow opened 7 months ago

laleoarrow commented 7 months ago

Also encountered "The estimated prior variance is unreasonably large" using ieugwasr ld_matrix_local function to calculate the ld matrix. As I have checked the original code of _ld_matrixlocal function in ieugwasr package in GitHub, when calculating the ld matrix _ld_matrixlocal does have "--keep-allele-order" argument in the plink command. But such error still exists. Please advice.

image
laleoarrow commented 7 months ago

FYI my session info: `> sessionInfo() R version 4.3.3 (2024-02-29) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Sonoma 14.4.1

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] LDlinkR_1.3.0 plinkbinr_0.0.0.9000 ieugwasr_0.2.2-9000 TwoSampleMR_0.5.10 coloc_5.2.3
[6] vroom_1.6.5 data.table_1.15.4 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[11] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[16] ggplot2_3.5.0 tidyverse_2.0.0

loaded via a namespace (and not attached): [1] gtable_0.3.4 htmlwidgets_1.6.4 ggrepel_0.9.5 lattice_0.22-5 tzdb_0.4.0
[6] vctrs_0.6.5 tools_4.3.3 generics_0.1.3 parallel_4.3.3 curl_5.2.1
[11] fansi_1.0.6 pkgconfig_2.0.3 Matrix_1.6-5 ggnewscale_0.4.10 RcppParallel_5.1.7 [16] uuid_1.2-0 lifecycle_1.0.4 compiler_4.3.3 munsell_0.5.0 htmltools_0.5.7
[21] pillar_1.9.0 crayon_1.5.2 viridis_0.6.5 tidyselect_1.2.0 digest_0.6.34
[26] stringi_1.8.3 fastmap_1.1.1 grid_4.3.3 colorspace_2.1-0 cli_3.6.2
[31] magrittr_2.0.3 patchwork_1.2.0 Rfast_2.1.0 utf8_1.2.4 withr_3.0.0
[36] scales_1.3.0 rsthemes_0.4.0 RcppZiggurat_0.1.6 bit64_4.0.5 timechange_0.3.0
[41] httr_1.4.7 matrixStats_1.2.0 bit_4.0.5 gridExtra_2.3 hms_1.1.3
[46] plink2R_1.1 irlba_2.3.5.1 viridisLite_0.4.2 geni.plots_0.0.9 rlang_1.1.3
[51] ggiraph_0.8.9 Rcpp_1.0.12 mixsqp_0.3-54 glue_1.7.0 susieR_0.12.40
[56] rstudioapi_0.15.0 reshape_0.8.9 jsonlite_1.8.8 R6_2.5.1 plyr_1.8.9
[61] systemfonts_1.0.6 `

pcarbo commented 7 months ago

@leoarrow1 If you haven't already done so, I would recommend checking for inconsistencies between the LD matrix and the matrix z-scores following this vignette.

laleoarrow commented 7 months ago

@leoarrow1 If you haven't already done so, I would recommend checking for inconsistencies between the LD matrix and the matrix z-scores following this vignette.

Thanks for your prompt response. Yes I have check this vignette as well as the former similar post on this issue. The solution was to add the "Adding the flag --keep-allele-order" in plink. But I have checked the ieugwasr code, it does have this flag in it. So this problem might be cause due to something else? Could you perhaps spot anything unusual in my code? dataset is a standard format that is suitable for coloc analysis. (i.e., Only missing the LD matrix for color-susie)

Coloc_to_SuSiE <- function( # my self-made function to change coloc data into coloc_susie data
    dataset, 
    # ld_method = "plink", 
    bfile = "/Users/xxx/project/ref/1kg.v3/EUR"
    ) {
dataset$LD <- ieugwasr::ld_matrix_local(
                           dataset$snp,
                           bfile = bfile,
                           plink_bin = plinkbinr::get_plink_exe(),
                           with_alleles = FALSE
                           )

ld_rows <- rownames(dataset[["LD"]])
snps <- dataset[["snp"]]
index <- match(ld_rows, snps)

for (i in names(dataset)) {
    if (i %in% c("LD", "type", "s", "N")) next
    if (length(dataset[[i]]) > ld_rows_number & length(dataset[[i]]) > 1) {
      dataset[[i]] <- dataset[[i]][index]
      if (length(dataset[[i]]) == ld_rows_number) {leo_message("91", paste0("The SNP in element <", i, "> is now matched with the LD matrix"))}
    }
dataset$N <- as.integer(dataset$N)[1]
  }
  return(dataset)
}

# the error occurs here
dat = Coloc_to_SuSiE(dataset)
S3 = runsusie(dat) # here.

Any thought is highly appreciated.

pcarbo commented 7 months ago

@leoarrow1 To better pinpoint the cause of the error, I would try to run susie_rss directly instead of through coloc.

laleoarrow commented 7 months ago

@leoarrow1 To better pinpoint the cause of the error, I would try to run susie_rss directly instead of through coloc.

Thanks! I will try so as soon as possible and let you know the result.

laleoarrow commented 7 months ago

Hi! I have tried running susie_rss directly which did yielded a result/ But the it did not converge as well, and this outcome is evidently incorrect (there is obviously too many cs?).

image

Upon reviewing the vignette, I identified inconsistencies in my susie_rss input data but, regrettably, I cannot locate their source.

image

I have checked the order of z and ld (rowname of ld) provided for susie_rss is in the same order though. Here is the code I used for susie_rss. Can you spot anything incorrectly? Any help is appreciated!

variants = dataset$snp 
# > head(variants)
# [1] "rs112438356" "rs112327275" "rs9270475"   "rs77122291"  "rs9270487"   "rs9270493" 
bfile = "/Users/leoarrow/project/ref/1kg.v3/EUR" # downloaded from https://mrcieu.github.io/ieugwasr/articles/local_ld.html
plink_bin = plinkbinr::get_plink_exe()

dataset$LD  <- ieugwasr::ld_matrix_local(
                           dataset$snp,
                           bfile = bfile,
                           plink_bin = plinkbinr::get_plink_exe(),
                           with_alleles = FALSE
                           )

ld_rows <- rownames(dataset[["LD"]]); ld_rows_number <- length(ld_rows)
snps <- dataset[["snp"]]; input_snps_number <- length(snps)
index <- match(ld_rows, snps)

susie_fit = susie_rss(dataset$z, dataset$LD, n=n)
susie_plot(susie_fit, y='PIP')
# lambda = estimate_s_rss(dataset$z, dataset$LD, n=n)
# lambda
condz = kriging_rss(dataset$z, dataset$LD, n=n)
condz$plot
laleoarrow commented 7 months ago

FYI, the source code of ieugwasr::ld_matrix_local is as below for ld matrix calculation. It use plink for calculation but it does have the --keep-allele-order tag in it: (Ref:https://github.com/MRCIEU/ieugwasr/blob/master/R/ld_matrix.R)

ld_matrix_local <- function(variants, bfile, plink_bin, with_alleles=TRUE) {
    # Make textfile
    shell <- ifelse(Sys.info()['sysname'] == "Windows", "cmd", "sh")
    fn <- tempfile()
    write.table(data.frame(variants), file=fn, row.names=FALSE, col.names=FALSE, quote=FALSE)

    fun1 <- paste0(
        shQuote(plink_bin, type=shell),
        " --bfile ", shQuote(bfile, type=shell),
        " --extract ", shQuote(fn, type=shell), 
        " --make-just-bim ", 
        " --keep-allele-order ",
        " --out ", shQuote(fn, type=shell)
    )
    system(fun1)

    bim <- read.table(paste0(fn, ".bim"), stringsAsFactors=FALSE)

    fun2 <- paste0(
        shQuote(plink_bin, type=shell),
        " --bfile ", shQuote(bfile, type=shell),
        " --extract ", shQuote(fn, type=shell), 
        " --r square ", 
        " --keep-allele-order ",
        " --out ", shQuote(fn, type=shell)
    )
    system(fun2)
    res <- read.table(paste0(fn, ".ld"), header=FALSE) %>% as.matrix
    if(with_alleles)
    {
        rownames(res) <- colnames(res) <- paste(bim$V2, bim$V5, bim$V6, sep="_")
    } else {
        rownames(res) <- colnames(res) <- bim$V2
    }
    return(res)
}

It looks all right to me. Or is there any other option to do LD matrix calculation? PS. I'm using gwas summary data and I do not have the genotype data. Also coloc recommend to use 1000 or above snps in the coloc loci, so LD matrix can only be calculated using local method?

pcarbo commented 7 months ago

@leoarrow1 I'm not familiar with the ieugwasr R package (I have not used it before), but if you compute the LD matrix from the same genotype data used in computing the association statistics (e.g., z-scores), there should be no inconsistencies in your kriging_rss plot. On the other hand, if the LD matrix is computed using different genotype data, then it is quite possible to obtain large inconsistencies. Removing the largest inconsistencies may involve removing some SNPs, but we have not (yet) developed a systematic procedure for removing large inconsistencies. Hope this helps (a bit).

laleoarrow commented 7 months ago

Is there a recommended method to compute LD matrix in ref panel for summary statistics? Thanks

--------------原始邮件-------------- 发件人:"Peter Carbonetto @.>; 发送时间:2024年4月16日(星期二) 凌晨3:46 收件人:"stephenslab/susieR" @.>; 抄送:"leoarrow1 @.>;"Mention @.>; 主题:Re: [stephenslab/susieR] Also encountered "The estimated prior variance is unreasonably large" using ieugwasr ld_matrix_local function to calculate the ld matrix (Issue #223)

@leoarrow1 I'm not familiar with the ieugwasr R package (I have not used it before), but if you compute the LD matrix from the same genotype data used in computing the association statistics (e.g., z-scores), there should be no inconsistencies in your kriging_rss plot. On the other hand, if the LD matrix is computed using different genotype data, then it is quite possible to obtain large inconsistencies. Removing the largest inconsistencies may involve removing some SNPs, but we have not (yet) developed a systematic procedure for removing large inconsistencies. Hope this helps (a bit).

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

pcarbo commented 7 months ago

@leoarrow1 It would depend whether you have access to the same data that was used to generate the association statistics. If the answer is no, then we do not have precise recommendations except that you should work with genotype data that most closely resembles the data used to generate the association statistics. Perhaps @gaow can provide more detailed recommendations; I believe he is working on bioinformatics pipelines to help with this.

laleoarrow commented 7 months ago

Thanks for reply we do not have Original genome type individual data, but I guess we can use the reference individual ref panel from 1 kg, which is frequently used as reference data in post gwas analysis?  Message ID: @.***>

akhilpampana commented 7 months ago

Hello, I am also facing a similar issue with ld matrix inconsistency. I have used PLINK to generate LD matrices and kept --keep-allele-order. I am using the meta effect of gwas analysis and using 1000g as a reference for LD. My correlation plot is off. Can you help me out with this issue? Here is the plot for the reference. I have used a 1000g complete dataset as the gwas results are from multi-ancestry datasets. image

I have also noticed a thing like variant which is shown to have significant tophit for the loci is not available in the credible set. Is it the opposite that we should have that top variant in the credible set?

Regards Akhil

stephens999 commented 7 months ago

To debug your pipeline i would suggest hand checking some example snp pairs. Eg find some snp pairs that have high ld in your computer matrix but very different z scores. Such pairs are obviously problematic. Maybe finding them will help you identify your problem.

Sorry, but i hope you understand we don't have time to help debug all user pipelines.

Matthew

On Fri, Apr 19, 2024, 12:59 AkhilP @.***> wrote:

Hello, I am also facing a similar issue with ld matrix inconsistency. I have used PLINK to generate LD matrices and kept --keep-allele-order. I am using the meta effect of gwas analysis and using 1000g as a reference for LD. My correlation plot is off. Can you help me out with this issue? Here is the plot for the reference. I have used a 1000g complete dataset as the gwas results are from multi-ancestry datasets. image.png (view on web) https://github.com/stephenslab/susieR/assets/19439352/261d6b70-ef29-4e76-9bbc-af1123afade1

I have also noticed a thing like variant which is shown to have significant tophit for the loci is not available in the credible set. Is it the opposite that we should have that top variant in the credible set?

Regards Akhil

— Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/223#issuecomment-2067049020, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRLAFQN7QVY3MJIMKZTY6FLRXAVCNFSM6AAAAABGBPRATGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGA2DSMBSGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

akhilpampana commented 7 months ago

Thank you so much for the response. I understand its very difficult to focus on all pipelines given the tool usage and appreciate your kind response.

I just want to know if the significant variant in the loci shouldl be present in the credible set or not.

Regards Akhil

laleoarrow commented 7 months ago

Hi, quick update for all. I have solved my issue (partially at least). I used ieugwasr package to calculate my ld matrix for a GWAS summary data using 1kg individual data as a reference. ieugwasr incorporated the plink 1.9 when calculating ld matrix and it does have the -keep-allele-order flag in it, however, the allele order would still possible be inconsistent with allele order in your own summary data. So the solution is to flip the allele order when needed. This should at least get you to converge in 100 iteration and a better z-score plot.

Nonetheless, I identified 9 cs in a loci with only 200+ snp. Is this possible? Here is the cs plot and z-score plot. Can anyone spot anything unusual in it? Thanks! 69181713446954_ pic 69111713446843_ pic

laleoarrow commented 7 months ago

Thank you so much for the response. I understand its very difficult to focus on all pipelines given the tool usage and appreciate your kind response.

I just want to know if the significant variant in the loci shouldl be present in the credible set or not.

Regards Akhil

The top snp should be in the strongest cs from my perspective.

gaow commented 7 months ago

@leoarrow1 @akhilpampana these problems typically has to do with problematic bioinformatics steps to unify the alleles between summary statistics and LD. The discrepancy between z-scores and LD reference is also a secondary concern.

We do have some pipelines that are still work in progress, to try cope with these issues. It contains two parts:

  1. we unify or harmonize the alleles between data-sets to the best of our knowledge.
  2. we apply some approaches to detect and remove variants where LD and z-scores have large discrepancies.

the 2nd item is still work in progress not ready for public use, although we have a separate branch of susieR to implement the prototype. The entire procedure is outlined here althought at this point it might be the most helpful to compare notes of your procedure with our basic QC for summary statistics.

akhilpampana commented 7 months ago

I think I have figured out my issue regarding the discrepancy of the plot but I will checkthrough the tutorial mentioned above just to be sure and test the analysis in the updated version of the ld matrix.

Thank you so much @gaow for the pipeline and @leoarrow1 for the suggestion

Regards Akhil

pcarbo commented 7 months ago

Thank you for sharing @akhilpampana. If you could share in more detail what was the issue and how you solved it, this might be helpful for us (and others) to know.

pcarbo commented 7 months ago

The top snp should be in the strongest cs from my perspective.

This will often be true, but sometimes not. See Fig. 1 of the susie paper for an illustration of this.