privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Seeking help for LDpred2 Implementation: resulting scores do not correlate with phenotype #363

Closed JiayiHelenZhou closed 1 year ago

JiayiHelenZhou commented 2 years ago

Hello Dr. Privé,

I am trying to generate polygenic risk scores with LDpred2's automatic model and have encountered some problems. I have got LDpred2 running without error and produced a set of results for 950 individuals. However, these LDpred2 produced scores correlate with neither my phenotype nor scores produced with PRSice2. I followed your tutorial: https://privefl.github.io/bigsnpr/articles/LDpred2.html.

Below are several things I have done in my analysis that are different from the tutorial:

  1. I imputed my genotype data using snp_fastImputeSimple() from bigsnpr to deal with missing data in my genotype matrix
  2. I used a publicly-available LD reference recommended in your tutorial instead of generating my own since my target data has only 950 individuals. The LD reference I used: https://figshare.com/articles/dataset/European_LD_reference_with_blocks_/19213299

I am wondering if you could provide me with some insights on how to implement LDpred2 and tackle this problem. Thank you so much!

privefl commented 2 years ago

I think you might have messed up the indices of columns used to get the final pred. You should be using df_beta.bmi[["_NUM_ID_"]] instead of numid[["GID"]].

A few minor remarks:

JiayiHelenZhou commented 2 years ago

Hi Dr. Privé, Thank you for your detailed suggestions. I run LDpred2 on cloud for extra memory while I have encountered some difficulty with my cloud. I have been working on solving it. In the meantime, could you please leave this ticket open? I will try to get back to you as soon as possible. Thank you so much for your patience!

privefl commented 2 years ago

Any update on this?

JiayiHelenZhou commented 2 years ago

Hi Dr. Privé,

Thank you for your patience. I have tried your suggestions. In my previous code, I used numid[["GID"]] instead of df_beta.bmi[["_NUM_ID_"]] since my target data has much fewer variants comparing to both the LDred (recommened by you: https://figshare.com/articles/dataset/European_LD_reference_with_blocks_/19213299) and the summary statistics that I am using. As the result, when using the _NUM_ID I was getting an "out of bound error". Thus, I wrote this part in my code to match the variants in my target data's SNPs back to the df_beta.bmi:

df_beta <- snp_match(bmi.sumstats, map)
(df_beta<- tidyr::drop_na(tibble::as_tibble(df_beta)))
df_beta.bmi <- snp_match(bmi.sumstats, map_ldref)
(df_beta.bmi<- tidyr::drop_na(tibble::as_tibble(df_beta.bmi)))

df_beta.id = df_beta %>% 
  select("rsid","_NUM_ID_") %>% 
  rename(GID = "_NUM_ID_")

df_beta.bmi.id = df_beta.bmi %>% 
  select("rsid", "_NUM_ID_")

numid = inner_join(df_beta.id, df_beta.bmi.id, by = "rsid")

df_beta.bmi <- df_beta.bmi[df_beta.bmi$rsid %in% numid$rsid,]
## save the beta rds file for later use:
## saveRDS(df_beta, file = "bmi_data/df_beta.rds")

Knowing that is may cause problem after consulting you, I have decided to us a imputed target dataset which contain much more variants. Using the imputed data solved the "out of bound" issue when running big_prodVec(G, beta_auto.noscaling, ind.row = ind.test, ind.col = df_beta.bmi[["_NUM_ID_"]]). However, now I my final pred_auto result contain NAs. Only 254 out of 950 individuals have scores produced. And the 254 scores did not correlated with PRSice2 produced polygenic scores: image

Could you please provide some insight on how to solve this problem? Thank you so much!

privefl commented 2 years ago

Have you looked at this example?

JiayiHelenZhou commented 2 years ago

Hi Dr. Privé,

Thank you for your patience. I was just able to get back working on this. I have looked at this example before, while I encountered a problem in the variants filtering step starting from line 27:

sd_ldref <- with(df_beta.bmi, sqrt(2 * af_UKBB * (1 - af_UKBB)))
sd_ss <- with(df_beta.bmi, 2 / sqrt(n_eff * beta_se^2))

is_bad <-
  sd_ss < (0.5 * sd_ldref) | sd_ss > (sd_ldref + 0.1) | sd_ss < 0.1 | sd_ldref < 0.05

library(ggplot2)
qplot(sd_ldref, sd_ss, color = is_bad, alpha = I(0.5)) +
  theme_bigstatsr() +
  coord_equal() +
  scale_color_viridis_d(direction = -1) +
  geom_abline(linetype = 2, color = "red") +
  labs(x = "Standard deviations derived from allele frequencies of the LD reference",
       y = "Standard deviations derived from the summary statistics",
       color = "Removed?")

My plot looks like this: AF_QC_plot

I was confused, and in my previous analysis I did not add in this step since it will remove all variants in the summary statistics file.

May I ask if I can still use the LD reference with my summary statistics file of interest? My target data have only 950 individuals, which is why I initially decided to use the provided LD reference instead of generating my own.

Thank you!

privefl commented 2 years ago

Are you using a continuous outcome?

Try with with(df_beta.bmi, 1 / sqrt(n_eff * beta_se^2)) instead of with(df_beta.bmi, 2 / sqrt(n_eff * beta_se^2)).

privefl commented 1 year ago

Did it work?