privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

Inference with LDpred2-auto #451

Closed beradan closed 9 months ago

beradan commented 9 months ago

Hi Florian,

I would like to incorporate the new inference functionality in my calculations. I could not but notice that you used slightly different parameters in your example than the ones you recommended in the description of LDpred2-auto.

  1. Are there any major advantages of using an initial p vector of 50 elements or using a higher number of iterations?

  2. Can I use the default parameter of 200 iterations and change the number of rows returned by the tail function from 500 to 200 to retrieve the estimated heritability, alpha, and polygenicity?

  3. Can I use report_step = 20 for the sampling betas to infer predictive performance without necessarily changing the other parameters?

Thank you for your answers.

Sandor

privefl commented 9 months ago
  1. The more you use, the safer it is. And you need enough samples to get 95% CIs.

  2. Yes, you can. But 200 iterations might not be enough as burn-in.

  3. Not sure I get this one.

beradan commented 9 months ago

I am trying to write a script using LDpred2-auto that calculates polygenic risk scores and infers heritability and predictive performance automatically. Previously, I have been using this code:

multi_auto <- snp_ldpred2_auto(
    corr, df_beta, h2_init = ldsc_h2_est,
    vec_p_init = seq_log(1e-4, 0.2, length.out = 30), ncores = NCORES,
    burn_in = 500, num_iter = 200,
    use_MLE = TRUE, allow_jump_sign = FALSE, shrink_corr = 0.95
)

To infer heritability, you recommended the following solution:

all_h2 <- sapply(multi_auto[keep], function(auto) tail(auto$path_h2_est, 500))
quantile(all_h2, c(0.5, 0.025, 0.975))

If I understand correctly, you use the tail function as a filter to infer heritability from iterations after burn-in only. So I have to change the "size of the tail" according to the size of num_iter, right?

The inference of predictive performance is a bit more tricky. I understand that this code requires sampling betas:

bsamp <- lapply(multi_auto[keep], function(auto) auto$sample_beta)
all_r2 <- do.call("cbind", lapply(seq_along(bsamp), function(ic) {
    b1 <- bsamp[[ic]]
    Rb1 <- apply(b1, 2, function(x)
        coef_shrink * bigsparser::sp_prodVec(corr, x) + (1 – 0.95) * x)
    b2 <- do.call("cbind", bsamp[-ic])
    b2Rb1 <- as.matrix(Matrix::crossprod(b2, Rb1))
}))
quantile(all_r2, c(0.5, 0.025, 0.975))

By default, snp_ldpred2_auto function does not report anything. I found it in the manual that using num_iter = 200 and report_step = 20 will report 10 vectors of sampling betas. Thus, I concluded that adding report_step = 20 as a new argument to my original code might work.

My last concern is that I do not know the correct way to go about it. Would you suggest that I use my original code at the beginning of this post, or would you say that it never hurts to use a larger number of iterations, and that it would be more appropriate to calculate polygenic risk scores with the code you proposed in the "Inference with LDpred2-auto" section of the tutorial?

privefl commented 9 months ago

If I understand correctly, you use the tail function as a filter to infer heritability from iterations after burn-in only. So I have to change the "size of the tail" according to the size of num_iter, right?

Exactly

By default, snp_ldpred2_auto function does not report anything. I found it in the manual that using num_iter = 200 and report_step = 20 will report 10 vectors of sampling betas. Thus, I concluded that adding report_step = 20 as a new argument to my original code might work.

Yes that works, but maybe use 500 iterations instead, and you'll get 25 samples per chain.

My last concern is that I do not know the correct way to go about it. Would you suggest that I use my original code at the beginning of this post, or would you say that it never hurts to use a larger number of iterations, and that it would be more appropriate to calculate polygenic risk scores with the code you proposed in the "Inference with LDpred2-auto" section of the tutorial?

You should get similar results for polygenic scores using either 200 or 500 iterations.

beradan commented 9 months ago

Thank you for your feedback.