LDpred2: running auto model gives NAs for all beta_est

SaraBecelaere commented 2 months ago

Dear Florian,

I’m using LDpred2 to calculate PGS. I’ve been able to use the infinitesimal and grid model successfully. However, when I’m trying to run the auto grid and auto model, I’m running in some issues. I’ve done the following:

coef_shrink <- 0.95
multi_auto <- snp_ldpred2_auto(corr, info_snp, h2_init = ldsc_h2_est, vec_p_init = seq_log(1e-4, 0.2, length.out = 30), ncores = 2, #use_MLE = FALSE,  allow_jump_sign = FALSE, shrink_corr = coef_shrink)

When I then check multi_auto[[1]], I see that there are NA’s everywhere (all beta_est, postp_est, corr_est are NA). When I change coef_shrink to 0.8 or 0.5, I still get the same result. I have also tried to set use_MLE = FALSE, and vary the coef_shrink parameter, but see no change. Any suggestions about what might be going wrong?

Below you can also see how I created corr, info_snp and ldsc_h2_est:

obj.bigSNP <- snp_attach("data.rds"))
G   <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
y   <- obj.bigSNP$fam$affection
map_test <- setNames(obj.bigSNP$map[-3], c("chr", "rsid", "pos", "a1", "a0"))

map_ldref < -readRDS(runonce::download_file("https://figshare.com/ndownloader/files/37802721", dir = "tmp-data"), fname = "map_hm3_plus.rds"))
map_ldref <- as.data.frame(map_ldref)
in_test <- vctrs::vec_in(map_ldref[, c("rsid")], map_test[, "rsid"])
map_ldref2 <- map_ldref[in_test, ]
info_snp <- snp_match(sumstats, map_ldref2, join_by_pos = FALSE)

tmp <- tempfile(tmpdir = "tmp-data"))
for (chr in 1:22) {
cat(chr, ".. ", sep = "")
ind.chr <- which(info_snp$chr == chr)
ind.chr2 <- info_snp$`_NUM_ID_`[ind.chr]
ind.chr3 <- match(ind.chr2, which(map_ldref2$chr == chr))
corr_chr <- readRDS(paste0("/ldref_hm3_plus/LD_with_blocks_chr", chr, ".rds"))[ind.chr3, ind.chr3]
  if (chr == 1) {
corr <- as_SFBM(corr_chr, tmp, compact = TRUE)
} else {
corr$add_columns(corr_chr, nrow(corr))
}}

ldsc <- with(info_snp, snp_ldsc(ld, nrow(map_ldref2), chi2 = (beta / beta_se)^2,
                                sample_size = n_eff, blocks = NULL))
ldsc_h2_est <- ldsc[["h2"]]

Thanks in advance for the help.

Best, Sara

privefl commented 2 months ago

Can you show me the QC plot comparing standard deviations?

SaraBecelaere commented 2 months ago

Hi Florian,

Below the QC plot:

QC_plot_sd_comparison

privefl commented 2 months ago

Are you removing everything? Or skipping this QC step?
Is this a GWAS for a binary or continuous phenotype?

SaraBecelaere commented 2 months ago

I skipped this QC step
This is a GWAS for height, so continuous

privefl commented 2 months ago

I guess you used 2 for the numerator, where you probably have used 1 instead (sd(y), probably was standardized).
But the plot is not horrible, so it won't probably make a large difference to do the QC or not.
If you point me to the sumstats you've used, I can probably try in the next few days.

SaraBecelaere commented 2 months ago

The summary statistics I used are from Yengo et al. 2022, the European summary stats. I used rsID’s instead of chr:pos since the samples I want to calculate the scores for are in build 38.
However, I doubt whether the GWAS is the issue here, since at some point I was able to run the auto model on a subset of the data (that had less SNPs). However, when using the full dataset I got errors, which seemed weird to me. That is why I changed the code (above) to taking an overlap of the SNPs between the HapMap dataset, GWAS and plink data before running the correlation matrix. Below the code I used on the subset, which didn’t work on the full dataset:

map_ldref <- readRDS(runonce::download_file("https://figshare.com/ndownloader/files/37802721", dir = "tmp-data", fname = "map_hm3_plus.rds"))
info_snp <- snp_match(sumstats, map_ldref, join_by_pos = FALSE)
obj.bigSNP <- snp_attach(“data.rds"))
G   <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
y   <- obj.bigSNP$fam$affection

map_test <- setNames(obj.bigSNP$map[-3], c("chr", "rsid", "pos", "a1", "a0"))
in_test <- vctrs::vec_in(info_snp[, c("rsid")], map_test[, "rsid"])
df_beta <- info_snp[in_test, ]

Creating correlation matrix:

tmp <- tempfile(tmpdir = "tmp-data")
for (chr in 1:22) {
cat(chr, ".. ", sep = "")
ind.chr <- which(info_snp$chr == chr)
ind.chr2 <- info_snp$`_NUM_ID_`[ind.chr]
ind.chr3 <- match(ind.chr2, which(map_ldref2$chr == chr))
corr_chr <- readRDS(paste0("/ldref_hm3_plus/LD_with_blocks_chr", chr, ".rds"))[ind.chr3, ind.chr3]
  if (chr == 1) {
corr <- as_SFBM(corr_chr, tmp, compact = TRUE)
} else {
corr$add_columns(corr_chr, nrow(corr))
}}
df_beta2 <- snp_match(df_beta, map_test, join_by_pos = FALSE)

Getting h2:

ldsc <- with(df_beta, snp_ldsc(ld, nrow(map_ldref), chi2 = (beta / beta_se)^2, sample_size = n_eff, blocks = NULL))
ldsc_h2_est <- ldsc[["h2"]]

Running auto model:

coef_shrink <- 0.95  
multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = ldsc_h2_est, vec_p_init = seq_log(1e-4, 0.2, length.out = 30), ncores = NCORES, # use_MLE = FALSE, allow_jump_sign = FALSE, shrink_corr = coef_shrink)

privefl commented 2 months ago

I have used these GWAS summary statistics in a previous paper, and it worked just fine. Cf. this code, where you can skip the ancestry part at the beginning, and the annotation part at the end.

SaraBecelaere commented 2 months ago

Hi Florian,

With your code I now got beta values! Thank you!

However, I’m now struggling to get predictions for my data, since the amount of beta’s and the amount of SNPs in my data are different. I tried the following:

obj.bigSNP <- snp_attach("data.rds")
G   <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
y   <- obj.bigSNP$fam$affection
map_test <- setNames(obj.bigSNP$map[-3], c("chr", "rsid", "pos", "a1", "a0"))

df_beta2 <- data.frame(df_beta) #df_beta here is the same as in the code you referred me to
in_test_2 <- vctrs::vec_in(df_beta2[, c("rsid")], map_test[, "rsid"])
df_beta3 <- df_beta2[in_test_2, ]
info_snp3 <- snp_match(sumstats, df_beta3, join_by_pos = FALSE)
pred_auto_grid <- big_prodMat(G, beta_auto_grid, ind.col = info_snp3[["_NUM_ID_"]])

Then I get the following error:

Error: Incompatibility between dimensions. 'ind.col' and 'rows_along(A.col)' should have the same length.

Do you have any idea how to solve this?

privefl commented 2 months ago

Have a look at this example code.

SaraBecelaere commented 2 months ago

I've tried the example code, but then I get the error below when trying the following code

pred_auto <- big_prodVec(G, beta_auto_grid * map_pgs2$beta,
                         ind.col = map_pgs2[["_NUM_ID_"]],
                         ncores = NCORES)

Error: Incompatibility between dimensions. 'y.col' and 'ind.col' should have the same length.

FYI, beta_auto_grid contains 1,008,330 lines and 50 columns; map_pgs2 contains 73,197 lines (I'm using a subset of my data to test the code, hence the low number of SNPs that match).

privefl commented 2 months ago

beta_auto_grid seems to be a matrix, so you should use big_prodMat instead.

SaraBecelaere commented 2 months ago

Ah yes, I overlooked that!

However, I still get this error:

Error: Incompatibility between dimensions. 'ind.col' and 'rows_along(A.col)' should have the same length.

Here's the code I used:

pred_auto <- big_prodMat(G, beta_auto_grid * map_pgs2$beta, ind.col = map_pgs2[["_NUM_ID_"]], ncores = 1)

privefl commented 2 months ago

I guess you didn't filter on the variants in your test set beforehand (cf. this code).

Then you might need to use something like beta_auto_grid[map_pgs2[["_NUM_ID_.ss"]], ] instead.

SaraBecelaere commented 2 months ago

I see that when using beta_auto_grid[map_pgs2[["_NUM_ID_.ss"]], ] the number of lines is the same as in map_pgs2 now. However, I get this error now:

Error in big_prodMat(G, beta_auto_grid[map_pgs2[["_NUM_ID_.ss"]], ], ind.col = map_pgs2[["_NUM_ID_.ss"]],  :
  all(ind.col <= ncol(X)) is not TRUE

Below the code used:

pred_auto <- big_prodMat(G, beta_auto_grid[map_pgs2[["_NUM_ID_.ss"]], ], ind.col = map_pgs2[["_NUM_ID_.ss"]], ncores = 1)

privefl commented 2 months ago

Do not modify ind.col.

SaraBecelaere commented 2 months ago

I wasn't sure whether you meant don't modify ind.col from previous code or don't add in the function. So I tried both and both give an error

pred_auto <- big_prodMat(G, beta_auto_grid[map_pgs2[["_NUM_ID_.ss"]], ], ncores = 1)

Error: Incompatibility between dimensions.
'ind.col' and 'rows_along(A.col)' should have the same length.

pred_auto <- big_prodMat(G, beta_auto_grid[map_pgs2[["_NUM_ID_.ss"]], ], ind.col = map_pgs2[["_NUM_ID_"]], ncores = 1)
```
Error in { :
task 1 failed - "Not compatible with requested type: [type=list; target=double]."
```

privefl commented 2 months ago

I meant the second one; leave it as before. I thought beta_auto_grid was a matrix; is it a data frame or a list?

SaraBecelaere commented 2 months ago

Yes, beta_auto_grid is a data frame, because I saved it from a previous run.

I transformed it into a matrix and now it seems to work! Thank you so much for your help with this!

privefl / bigsnpr

LDpred2: running auto model gives NAs for all beta_est #495