privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

LDPred2 with custom reference panel #406

Closed sourena-s closed 1 year ago

sourena-s commented 1 year ago

Hi, I am planning to use LDpred2 to build PGS in UK Biobank subjects. After variant QC, I have > ~9 million variants. I wonder if I will get any accuracy boost if I build the LD reference from two thousand individuals of the same cohort (instead of filtering to the HapMap plus panel, which leaves ~1.3M after matching with UKB).

So far, I could manage to create LD matrices following the example code for the smaller chromosomes (chr 15-20). However, for those chromosomes with > 300k variants, I receive a memory-related error when the below line tries to generate SFBM matrices:

corr <- as_SFBM(corr0, tmp)

caught segfault address 0x151d19847000, cause 'memory not mapped'

Traceback: 1: col_count_sym(spmat@p, spmat@i) 2: .Object$initialize(...) 3: initialize(value, ...) 4: initialize(value, ...) 5: methods::new("SFBM", spmat = spmat, backingfile = backingfile) 6: as_SFBM(corr0, tmp)

Any advice is greatly appreciated. Thanks, Sourena

privefl commented 1 year ago

The current version of LDpred is not suitable to use with 9M variants for now, sorry. Moreover, this does not always lead to better predictive power. I have discussed this in multiple issues here.

sourena-s commented 1 year ago

Thanks for the clarification. A follow-up question: While using the grid model, out of 50 chains, only 18 of them converge and the rest generate nans. The h2_est of the trait is 0.028, and I use the below parameters in the grid search:

h2_seq <- round(h2_est * seq_log(0.2,5,5) , 4) p_seq <- signif(seq_log(1e-5, 0.9, length. Out = 5), 2) params <- expand. Grid(p = p_seq, h2 = h2_seq, sparse = c(FALSE, TRUE))

beta_grid <- snp_ldpred2_grid(corr, df_beta, params, ncores = 32, burn_in=1000, num_iter = 10000)

Am I using the correct settings?

Thanks, Sourena

privefl commented 1 year ago

Which packageVersion("bigsnpr") do you have?

There is no need to run a huge number of iterations as you did.

sourena-s commented 1 year ago

packageVersion("bigsnpr") [1] ‘1.12.0’

privefl commented 1 year ago

Try to update the package, and reduce the number of iterations.

The more iterations you do, the more chance it has of diverging.

sourena-s commented 1 year ago

Thanks, I updated to 1.12.1 from GitHub. This time I used the following settings:

_set. Seed(1) n_iter=1000 n_burn=250

multi_auto_no_alpha <- snp_ldpred2_auto(corr, burn_in=n_burn, num_iter = n_iter, df_beta, verbose=TRUE, sparse=TRUE, h2_init = h2_est, use_MLE=FALSE, allow_jump_sign=FALSE, shrink_corr = 0.95, vec_p_init = seq_log(0.01, 1, length. Out =25), ncores = 64, ind.cor = info_snp$_NUM_ID_)_

The simulation paths look as attached. Although there was no divergence this time (i.e. no NANs), all 25 chains show the same oscillating h2 paths.

Should I accept these results or continue troubleshooting? Gibbs Sourena

privefl commented 1 year ago

No, this path is odd. There is something wrong here. What LD are you using? Is it from the same population as the summary statistics in df_beta? Did you check hist(with(df_beta, pchisq((beta / beta_se)^2, df = 1, lower.tail = FALSE)))? And how many variants do you have?

sourena-s commented 1 year ago

That's what I thought. As an additional note, all chains with use_MLE=TRUE had diverged before. I calculated LD from 2000 random subjects of the target population (white British ancestry from UKB). The GWAS is of a closely related ancestry.

I used the below template code for creating LD matrices:

for (chr in 1:22) {
    ind.chr <- which(as.integer(genotype$map$chromosome) == chr)
    print (paste("calculating LD for ", length(ind.chr), "variants in chromosome", chr))
    corr0 <- snp_cor(genotype$genotype, ind.col = ind.chr, ncores = 1, infos.pos = POS2[ind.chr], size= 3/1000)

    if (chr == 1) {
        ld <- Matrix::colSums(corr0^2)
        corr <- as_SFBM(corr0, tmp)
    } else {
        ld <- c(ld, Matrix::colSums(corr0^2))
        corr$add_columns(corr0, nrow(corr))
    }
    print (paste("Chromosome", chr,"processed."))
}

And the Chisq histogram looks as below: hist

privefl commented 1 year ago

The distribution of p-values looks fine.

What are dim(corr) and file.size(corr$backingfile)?

sourena-s commented 1 year ago

dim(corr) [1] 1303558 1303558 file. Size(corr$backingfile) [1] 52258540224

privefl commented 1 year ago

Seems fine as well.

Did you try with another set of GWAS summary statistics? Are the ones you're using publicly available? So that I could maybe try them.

BTW, why did you choose not to use the LD provided?

sourena-s commented 1 year ago

I am using two GWASs at this moment, this one is publicly available: https://www.ebi.ac.uk/gwas/studies/GCST90027158 I only keep those variants with the maximum sample size.

Well in the beginning I thought I could make the polygenic model work with 8M variants (how this thread started), so I calculated the LD locally. But now the problem is with convergence.

Thanks for your advice

privefl commented 1 year ago

I see, so you're restricting to 1.3M variants with max N. These may be highly correlated and might not cover well the genome.

I think you would benefit from restricting to HM3 or HM3+ variants, and keeping all of those with N > 0.7*max(N).

sourena-s commented 1 year ago

I will run that this week and share the results. While running lassosum2 on the same data, the model optimizes with a very large delta value of 100, lambda of ~0.0008, and sparsity ~0.54, resulting in max(r2) ~ 0.003.

As far as I understand after reading the paper, delta is equal to L2 regularization and shrinks the off-diagonal elements of the LD matrix. Considering the fact that the model optimizes at a very large delta value, does that mean that the calculated LD may be wrong?

Please see attached the lassosum2 results. Colors code delta in the range of 0.01 to 10000 (sorry for the lack of labels, our server doesn't X11 forward text properly and I need to rerun everything on my laptop). lassosum2

privefl commented 1 year ago

A value of 100 basically means you're using the identity matrix. So, yes there must be something wrong. You shouldn't have to use more than 1 (or maybe 3 in extreme cases).

privefl commented 1 year ago

Any update on this?

sourena-s commented 1 year ago

I made some observation; some context: So the GWASs from which I am developing PRS are case/control psychiatric and neurological disorder GWASs. But I am optimizing them in my target cohort, not with regard to the same (binary) phenotype, but rather wrt continuous phenotypes derived from brain MRI. These are volumetric measures of the brains (regional volume) and 'integrity' of white-matter tracts. Both of which are well-behaved and quite Gaussian distributed.

In some cases, for example when I use a GWAS of the same MRI phenotypes to develop PRS using lassosum2, the L1/L2 penalties and optimization paths are as expected (i.e., very low L2). For example, this is a PRS generated from cortical surface area GWAS and then optimized in the target population using the same class of structural brain MRI phenotypes:

lh_surface_area_lassosum_PC198

However, when I try to generate PRS using e.g. schizophrenia GWAS, I get different optimization paths for structural MRI (regional volume phenotype) versus diffusion MRI (white-matter integrity phenotype):

SCZ_TBM

SCZ_DMRI

In the above plots, it is apparent that a strong L2 penalty increases the accuracy of the PRS in structural but not diffusion MRI.

As far as I understand, such a strong L2 would 'destroy' the LD contribution in suppressing variants that take part in the PRS model, making the model assume that all variants are independent. So I checked the Pearson correlation of each PRS generated in the grid search with an orthogonal method (SBayesR, dashed line) using another neurological GWAS (same as my last month's previous comment), and it turns out that the correlation is very high anyway for many of the grid points (r > 0.8-0.9).

lassosum2_hapmap_corr_sbayesR

I don't know what the exact reason is, but it seems that increasing L2 in some cases gives an extra boost of accuracy in generating a PRS from a neuropsychiatric case/control GWAS in explaining phenotypes derived from quantitative brain MRI.

Extra notes: calculating LD from 2000 subjects of the target population vs the off-the-shelf reference panels doesn't affect the accuracy at all, as doesn't the use of the top 3 million significant variants from the GWAS instead of the HapMap3+ variants (I find the latter very interesting).

privefl commented 1 year ago

As far as I understand, using a very strong L2 reg corresponds to using the identity matrix as LD ref, which corresponds to just using (some scaled version) of the GWAS effect sizes directly. This is clumping and thresholding without clumping nor thresholding.. i.e. simply using the GWAS effect sizes as predictive effects without any filtering. That should not be the optimal solution.

But because you're trying to predict something else, I don't know..

privefl commented 1 year ago

Are your questions answered, or do you still have a few more?

sourena-s commented 1 year ago

Following this conversation, I think I better understand how the software works, but I need to explore this particular use case and make sense of the results in the coming weeks. Thanks for following up.

privefl commented 1 year ago

Any update on this?