Closed JiayiHelenZhou closed 1 year ago
I think you might have messed up the indices of columns used to get the final pred
. You should be using df_beta.bmi[["_NUM_ID_"]]
instead of numid[["GID"]]
.
A few minor remarks:
sum(is.na(G[1:950, 1:671770]))
as you're accessing the entire genotype matrix in memory; instead do something like sum(big_counts(G)[4, ])
snp_match()
) and missing values (if not in beta/beta_se/n_eff)vec_p_init = 0.03
means you're running only one chain, which is not recommended (and ncores = NCORES
is then useless)ind.test = 1:nrow(G2)
, it's the default in most functionsHi Dr. Privé, Thank you for your detailed suggestions. I run LDpred2 on cloud for extra memory while I have encountered some difficulty with my cloud. I have been working on solving it. In the meantime, could you please leave this ticket open? I will try to get back to you as soon as possible. Thank you so much for your patience!
Any update on this?
Hi Dr. Privé,
Thank you for your patience. I have tried your suggestions. In my previous code, I used numid[["GID"]]
instead of df_beta.bmi[["_NUM_ID_"]]
since my target data has much fewer variants comparing to both the LDred (recommened by you: https://figshare.com/articles/dataset/European_LD_reference_with_blocks_/19213299) and the summary statistics that I am using. As the result, when using the _NUM_ID
I was getting an "out of bound error". Thus, I wrote this part in my code to match the variants in my target data's SNPs back to the df_beta.bmi
:
df_beta <- snp_match(bmi.sumstats, map)
(df_beta<- tidyr::drop_na(tibble::as_tibble(df_beta)))
df_beta.bmi <- snp_match(bmi.sumstats, map_ldref)
(df_beta.bmi<- tidyr::drop_na(tibble::as_tibble(df_beta.bmi)))
df_beta.id = df_beta %>%
select("rsid","_NUM_ID_") %>%
rename(GID = "_NUM_ID_")
df_beta.bmi.id = df_beta.bmi %>%
select("rsid", "_NUM_ID_")
numid = inner_join(df_beta.id, df_beta.bmi.id, by = "rsid")
df_beta.bmi <- df_beta.bmi[df_beta.bmi$rsid %in% numid$rsid,]
## save the beta rds file for later use:
## saveRDS(df_beta, file = "bmi_data/df_beta.rds")
Knowing that is may cause problem after consulting you, I have decided to us a imputed target dataset which contain much more variants. Using the imputed data solved the "out of bound" issue when running big_prodVec(G, beta_auto.noscaling, ind.row = ind.test, ind.col = df_beta.bmi[["_NUM_ID_"]])
. However, now I my final pred_auto
result contain NAs. Only 254 out of 950 individuals have scores produced. And the 254 scores did not correlated with PRSice2 produced polygenic scores:
Could you please provide some insight on how to solve this problem? Thank you so much!
Have you looked at this example?
Hi Dr. Privé,
Thank you for your patience. I was just able to get back working on this. I have looked at this example before, while I encountered a problem in the variants filtering step starting from line 27:
sd_ldref <- with(df_beta.bmi, sqrt(2 * af_UKBB * (1 - af_UKBB)))
sd_ss <- with(df_beta.bmi, 2 / sqrt(n_eff * beta_se^2))
is_bad <-
sd_ss < (0.5 * sd_ldref) | sd_ss > (sd_ldref + 0.1) | sd_ss < 0.1 | sd_ldref < 0.05
library(ggplot2)
qplot(sd_ldref, sd_ss, color = is_bad, alpha = I(0.5)) +
theme_bigstatsr() +
coord_equal() +
scale_color_viridis_d(direction = -1) +
geom_abline(linetype = 2, color = "red") +
labs(x = "Standard deviations derived from allele frequencies of the LD reference",
y = "Standard deviations derived from the summary statistics",
color = "Removed?")
My plot looks like this:
I was confused, and in my previous analysis I did not add in this step since it will remove all variants in the summary statistics file.
May I ask if I can still use the LD reference with my summary statistics file of interest? My target data have only 950 individuals, which is why I initially decided to use the provided LD reference instead of generating my own.
Thank you!
Are you using a continuous outcome?
Try with with(df_beta.bmi, 1 / sqrt(n_eff * beta_se^2))
instead of with(df_beta.bmi, 2 / sqrt(n_eff * beta_se^2))
.
Did it work?
Hello Dr. Privé,
I am trying to generate polygenic risk scores with LDpred2's automatic model and have encountered some problems. I have got LDpred2 running without error and produced a set of results for 950 individuals. However, these LDpred2 produced scores correlate with neither my phenotype nor scores produced with PRSice2. I followed your tutorial: https://privefl.github.io/bigsnpr/articles/LDpred2.html.
Below are several things I have done in my analysis that are different from the tutorial:
I am wondering if you could provide me with some insights on how to implement LDpred2 and tackle this problem. Thank you so much!