Open laleoarrow opened 7 months ago
FYI my session info: `> sessionInfo() R version 4.3.3 (2024-02-29) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Sonoma 14.4.1
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Asia/Shanghai tzcode source: internal
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] LDlinkR_1.3.0 plinkbinr_0.0.0.9000 ieugwasr_0.2.2-9000 TwoSampleMR_0.5.10 coloc_5.2.3
[6] vroom_1.6.5 data.table_1.15.4 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[11] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[16] ggplot2_3.5.0 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.4 htmlwidgets_1.6.4 ggrepel_0.9.5 lattice_0.22-5 tzdb_0.4.0
[6] vctrs_0.6.5 tools_4.3.3 generics_0.1.3 parallel_4.3.3 curl_5.2.1
[11] fansi_1.0.6 pkgconfig_2.0.3 Matrix_1.6-5 ggnewscale_0.4.10 RcppParallel_5.1.7
[16] uuid_1.2-0 lifecycle_1.0.4 compiler_4.3.3 munsell_0.5.0 htmltools_0.5.7
[21] pillar_1.9.0 crayon_1.5.2 viridis_0.6.5 tidyselect_1.2.0 digest_0.6.34
[26] stringi_1.8.3 fastmap_1.1.1 grid_4.3.3 colorspace_2.1-0 cli_3.6.2
[31] magrittr_2.0.3 patchwork_1.2.0 Rfast_2.1.0 utf8_1.2.4 withr_3.0.0
[36] scales_1.3.0 rsthemes_0.4.0 RcppZiggurat_0.1.6 bit64_4.0.5 timechange_0.3.0
[41] httr_1.4.7 matrixStats_1.2.0 bit_4.0.5 gridExtra_2.3 hms_1.1.3
[46] plink2R_1.1 irlba_2.3.5.1 viridisLite_0.4.2 geni.plots_0.0.9 rlang_1.1.3
[51] ggiraph_0.8.9 Rcpp_1.0.12 mixsqp_0.3-54 glue_1.7.0 susieR_0.12.40
[56] rstudioapi_0.15.0 reshape_0.8.9 jsonlite_1.8.8 R6_2.5.1 plyr_1.8.9
[61] systemfonts_1.0.6 `
@leoarrow1 If you haven't already done so, I would recommend checking for inconsistencies between the LD matrix and the matrix z-scores following this vignette.
@leoarrow1 If you haven't already done so, I would recommend checking for inconsistencies between the LD matrix and the matrix z-scores following this vignette.
Thanks for your prompt response. Yes I have check this vignette as well as the former similar post on this issue. The solution was to add the "Adding the flag --keep-allele-order" in plink. But I have checked the ieugwasr code, it does have this flag in it. So this problem might be cause due to something else? Could you perhaps spot anything unusual in my code? dataset
is a standard format that is suitable for coloc analysis. (i.e., Only missing the LD matrix for color-susie)
Coloc_to_SuSiE <- function( # my self-made function to change coloc data into coloc_susie data
dataset,
# ld_method = "plink",
bfile = "/Users/xxx/project/ref/1kg.v3/EUR"
) {
dataset$LD <- ieugwasr::ld_matrix_local(
dataset$snp,
bfile = bfile,
plink_bin = plinkbinr::get_plink_exe(),
with_alleles = FALSE
)
ld_rows <- rownames(dataset[["LD"]])
snps <- dataset[["snp"]]
index <- match(ld_rows, snps)
for (i in names(dataset)) {
if (i %in% c("LD", "type", "s", "N")) next
if (length(dataset[[i]]) > ld_rows_number & length(dataset[[i]]) > 1) {
dataset[[i]] <- dataset[[i]][index]
if (length(dataset[[i]]) == ld_rows_number) {leo_message("91", paste0("The SNP in element <", i, "> is now matched with the LD matrix"))}
}
dataset$N <- as.integer(dataset$N)[1]
}
return(dataset)
}
# the error occurs here
dat = Coloc_to_SuSiE(dataset)
S3 = runsusie(dat) # here.
Any thought is highly appreciated.
@leoarrow1 To better pinpoint the cause of the error, I would try to run susie_rss
directly instead of through coloc
.
@leoarrow1 To better pinpoint the cause of the error, I would try to run
susie_rss
directly instead of throughcoloc
.
Thanks! I will try so as soon as possible and let you know the result.
Hi! I have tried running susie_rss directly which did yielded a result/ But the it did not converge as well, and this outcome is evidently incorrect (there is obviously too many cs?).
Upon reviewing the vignette, I identified inconsistencies in my susie_rss input data but, regrettably, I cannot locate their source.
I have checked the order of z and ld (rowname of ld) provided for susie_rss is in the same order though. Here is the code I used for susie_rss
. Can you spot anything incorrectly? Any help is appreciated!
variants = dataset$snp
# > head(variants)
# [1] "rs112438356" "rs112327275" "rs9270475" "rs77122291" "rs9270487" "rs9270493"
bfile = "/Users/leoarrow/project/ref/1kg.v3/EUR" # downloaded from https://mrcieu.github.io/ieugwasr/articles/local_ld.html
plink_bin = plinkbinr::get_plink_exe()
dataset$LD <- ieugwasr::ld_matrix_local(
dataset$snp,
bfile = bfile,
plink_bin = plinkbinr::get_plink_exe(),
with_alleles = FALSE
)
ld_rows <- rownames(dataset[["LD"]]); ld_rows_number <- length(ld_rows)
snps <- dataset[["snp"]]; input_snps_number <- length(snps)
index <- match(ld_rows, snps)
susie_fit = susie_rss(dataset$z, dataset$LD, n=n)
susie_plot(susie_fit, y='PIP')
# lambda = estimate_s_rss(dataset$z, dataset$LD, n=n)
# lambda
condz = kriging_rss(dataset$z, dataset$LD, n=n)
condz$plot
FYI, the source code of ieugwasr::ld_matrix_local is as below for ld matrix calculation. It use plink for calculation but it does have the --keep-allele-order
tag in it: (Ref:https://github.com/MRCIEU/ieugwasr/blob/master/R/ld_matrix.R)
ld_matrix_local <- function(variants, bfile, plink_bin, with_alleles=TRUE) {
# Make textfile
shell <- ifelse(Sys.info()['sysname'] == "Windows", "cmd", "sh")
fn <- tempfile()
write.table(data.frame(variants), file=fn, row.names=FALSE, col.names=FALSE, quote=FALSE)
fun1 <- paste0(
shQuote(plink_bin, type=shell),
" --bfile ", shQuote(bfile, type=shell),
" --extract ", shQuote(fn, type=shell),
" --make-just-bim ",
" --keep-allele-order ",
" --out ", shQuote(fn, type=shell)
)
system(fun1)
bim <- read.table(paste0(fn, ".bim"), stringsAsFactors=FALSE)
fun2 <- paste0(
shQuote(plink_bin, type=shell),
" --bfile ", shQuote(bfile, type=shell),
" --extract ", shQuote(fn, type=shell),
" --r square ",
" --keep-allele-order ",
" --out ", shQuote(fn, type=shell)
)
system(fun2)
res <- read.table(paste0(fn, ".ld"), header=FALSE) %>% as.matrix
if(with_alleles)
{
rownames(res) <- colnames(res) <- paste(bim$V2, bim$V5, bim$V6, sep="_")
} else {
rownames(res) <- colnames(res) <- bim$V2
}
return(res)
}
It looks all right to me. Or is there any other option to do LD matrix calculation? PS. I'm using gwas summary data and I do not have the genotype data. Also coloc recommend to use 1000 or above snps in the coloc loci, so LD matrix can only be calculated using local method?
@leoarrow1 I'm not familiar with the ieugwasr
R package (I have not used it before), but if you compute the LD matrix from the same genotype data used in computing the association statistics (e.g., z-scores), there should be no inconsistencies in your kriging_rss
plot. On the other hand, if the LD matrix is computed using different genotype data, then it is quite possible to obtain large inconsistencies. Removing the largest inconsistencies may involve removing some SNPs, but we have not (yet) developed a systematic procedure for removing large inconsistencies. Hope this helps (a bit).
@leoarrow1 I'm not familiar with the ieugwasr R package (I have not used it before), but if you compute the LD matrix from the same genotype data used in computing the association statistics (e.g., z-scores), there should be no inconsistencies in your kriging_rss plot. On the other hand, if the LD matrix is computed using different genotype data, then it is quite possible to obtain large inconsistencies. Removing the largest inconsistencies may involve removing some SNPs, but we have not (yet) developed a systematic procedure for removing large inconsistencies. Hope this helps (a bit).
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
@leoarrow1 It would depend whether you have access to the same data that was used to generate the association statistics. If the answer is no, then we do not have precise recommendations except that you should work with genotype data that most closely resembles the data used to generate the association statistics. Perhaps @gaow can provide more detailed recommendations; I believe he is working on bioinformatics pipelines to help with this.
Thanks for reply we do not have Original genome type individual data, but I guess we can use the reference individual ref panel from 1 kg, which is frequently used as reference data in post gwas analysis? Message ID: @.***>
Hello, I am also facing a similar issue with ld matrix inconsistency. I have used PLINK to generate LD matrices and kept --keep-allele-order. I am using the meta effect of gwas analysis and using 1000g as a reference for LD. My correlation plot is off. Can you help me out with this issue? Here is the plot for the reference. I have used a 1000g complete dataset as the gwas results are from multi-ancestry datasets.
I have also noticed a thing like variant which is shown to have significant tophit for the loci is not available in the credible set. Is it the opposite that we should have that top variant in the credible set?
Regards Akhil
To debug your pipeline i would suggest hand checking some example snp pairs. Eg find some snp pairs that have high ld in your computer matrix but very different z scores. Such pairs are obviously problematic. Maybe finding them will help you identify your problem.
Sorry, but i hope you understand we don't have time to help debug all user pipelines.
Matthew
On Fri, Apr 19, 2024, 12:59 AkhilP @.***> wrote:
Hello, I am also facing a similar issue with ld matrix inconsistency. I have used PLINK to generate LD matrices and kept --keep-allele-order. I am using the meta effect of gwas analysis and using 1000g as a reference for LD. My correlation plot is off. Can you help me out with this issue? Here is the plot for the reference. I have used a 1000g complete dataset as the gwas results are from multi-ancestry datasets. image.png (view on web) https://github.com/stephenslab/susieR/assets/19439352/261d6b70-ef29-4e76-9bbc-af1123afade1
I have also noticed a thing like variant which is shown to have significant tophit for the loci is not available in the credible set. Is it the opposite that we should have that top variant in the credible set?
Regards Akhil
— Reply to this email directly, view it on GitHub https://github.com/stephenslab/susieR/issues/223#issuecomment-2067049020, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANXRRLAFQN7QVY3MJIMKZTY6FLRXAVCNFSM6AAAAABGBPRATGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGA2DSMBSGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you so much for the response. I understand its very difficult to focus on all pipelines given the tool usage and appreciate your kind response.
I just want to know if the significant variant in the loci shouldl be present in the credible set or not.
Regards Akhil
Hi, quick update for all. I have solved my issue (partially at least). I used ieugwasr package to calculate my ld matrix for a GWAS summary data using 1kg individual data as a reference. ieugwasr incorporated the plink 1.9 when calculating ld matrix and it does have the -keep-allele-order
flag in it, however, the allele order would still possible be inconsistent with allele order in your own summary data. So the solution is to flip the allele order when needed. This should at least get you to converge in 100 iteration and a better z-score plot.
Nonetheless, I identified 9 cs in a loci with only 200+ snp. Is this possible? Here is the cs plot and z-score plot. Can anyone spot anything unusual in it? Thanks!
Thank you so much for the response. I understand its very difficult to focus on all pipelines given the tool usage and appreciate your kind response.
I just want to know if the significant variant in the loci shouldl be present in the credible set or not.
Regards Akhil
The top snp should be in the strongest cs from my perspective.
@leoarrow1 @akhilpampana these problems typically has to do with problematic bioinformatics steps to unify the alleles between summary statistics and LD. The discrepancy between z-scores and LD reference is also a secondary concern.
We do have some pipelines that are still work in progress, to try cope with these issues. It contains two parts:
the 2nd item is still work in progress not ready for public use, although we have a separate branch of susieR to implement the prototype. The entire procedure is outlined here althought at this point it might be the most helpful to compare notes of your procedure with our basic QC for summary statistics.
I think I have figured out my issue regarding the discrepancy of the plot but I will checkthrough the tutorial mentioned above just to be sure and test the analysis in the updated version of the ld matrix.
Thank you so much @gaow for the pipeline and @leoarrow1 for the suggestion
Regards Akhil
Thank you for sharing @akhilpampana. If you could share in more detail what was the issue and how you solved it, this might be helpful for us (and others) to know.
The top snp should be in the strongest cs from my perspective.
This will often be true, but sometimes not. See Fig. 1 of the susie paper for an illustration of this.
Also encountered "The estimated prior variance is unreasonably large" using ieugwasr ld_matrix_local function to calculate the ld matrix. As I have checked the original code of _ld_matrixlocal function in ieugwasr package in GitHub, when calculating the ld matrix _ld_matrixlocal does have "--keep-allele-order" argument in the plink command. But such error still exists. Please advice.