Closed sourena-s closed 1 year ago
The current version of LDpred is not suitable to use with 9M variants for now, sorry. Moreover, this does not always lead to better predictive power. I have discussed this in multiple issues here.
Thanks for the clarification. A follow-up question: While using the grid model, out of 50 chains, only 18 of them converge and the rest generate nans. The h2_est of the trait is 0.028, and I use the below parameters in the grid search:
h2_seq <- round(h2_est * seq_log(0.2,5,5) , 4) p_seq <- signif(seq_log(1e-5, 0.9, length. Out = 5), 2) params <- expand. Grid(p = p_seq, h2 = h2_seq, sparse = c(FALSE, TRUE))
beta_grid <- snp_ldpred2_grid(corr, df_beta, params, ncores = 32, burn_in=1000, num_iter = 10000)
Am I using the correct settings?
Thanks, Sourena
Which packageVersion("bigsnpr")
do you have?
There is no need to run a huge number of iterations as you did.
packageVersion("bigsnpr") [1] ‘1.12.0’
Try to update the package, and reduce the number of iterations.
The more iterations you do, the more chance it has of diverging.
Thanks, I updated to 1.12.1 from GitHub. This time I used the following settings:
_set. Seed(1) n_iter=1000 n_burn=250
multi_auto_no_alpha <- snp_ldpred2_auto(corr,
burn_in=n_burn, num_iter = n_iter,
df_beta, verbose=TRUE, sparse=TRUE, h2_init = h2_est, use_MLE=FALSE, allow_jump_sign=FALSE,
shrink_corr = 0.95, vec_p_init = seq_log(0.01, 1, length. Out =25), ncores = 64, ind.cor = info_snp$_NUM_ID_
)_
The simulation paths look as attached. Although there was no divergence this time (i.e. no NANs), all 25 chains show the same oscillating h2 paths.
Should I accept these results or continue troubleshooting?
Sourena
No, this path is odd. There is something wrong here.
What LD are you using? Is it from the same population as the summary statistics in df_beta
?
Did you check hist(with(df_beta, pchisq((beta / beta_se)^2, df = 1, lower.tail = FALSE)))
?
And how many variants do you have?
That's what I thought. As an additional note, all chains with use_MLE=TRUE had diverged before. I calculated LD from 2000 random subjects of the target population (white British ancestry from UKB). The GWAS is of a closely related ancestry.
I used the below template code for creating LD matrices:
for (chr in 1:22) {
ind.chr <- which(as.integer(genotype$map$chromosome) == chr)
print (paste("calculating LD for ", length(ind.chr), "variants in chromosome", chr))
corr0 <- snp_cor(genotype$genotype, ind.col = ind.chr, ncores = 1, infos.pos = POS2[ind.chr], size= 3/1000)
if (chr == 1) {
ld <- Matrix::colSums(corr0^2)
corr <- as_SFBM(corr0, tmp)
} else {
ld <- c(ld, Matrix::colSums(corr0^2))
corr$add_columns(corr0, nrow(corr))
}
print (paste("Chromosome", chr,"processed."))
}
And the Chisq histogram looks as below:
The distribution of p-values looks fine.
What are dim(corr)
and file.size(corr$backingfile)
?
dim(corr) [1] 1303558 1303558 file. Size(corr$backingfile) [1] 52258540224
Seems fine as well.
Did you try with another set of GWAS summary statistics? Are the ones you're using publicly available? So that I could maybe try them.
BTW, why did you choose not to use the LD provided?
I am using two GWASs at this moment, this one is publicly available: https://www.ebi.ac.uk/gwas/studies/GCST90027158 I only keep those variants with the maximum sample size.
Well in the beginning I thought I could make the polygenic model work with 8M variants (how this thread started), so I calculated the LD locally. But now the problem is with convergence.
Thanks for your advice
I see, so you're restricting to 1.3M variants with max N. These may be highly correlated and might not cover well the genome.
I think you would benefit from restricting to HM3 or HM3+ variants, and keeping all of those with N > 0.7*max(N).
I will run that this week and share the results. While running lassosum2 on the same data, the model optimizes with a very large delta value of 100, lambda of ~0.0008, and sparsity ~0.54, resulting in max(r2) ~ 0.003.
As far as I understand after reading the paper, delta is equal to L2 regularization and shrinks the off-diagonal elements of the LD matrix. Considering the fact that the model optimizes at a very large delta value, does that mean that the calculated LD may be wrong?
Please see attached the lassosum2 results. Colors code delta in the range of 0.01 to 10000 (sorry for the lack of labels, our server doesn't X11 forward text properly and I need to rerun everything on my laptop).
A value of 100 basically means you're using the identity matrix. So, yes there must be something wrong. You shouldn't have to use more than 1 (or maybe 3 in extreme cases).
Any update on this?
I made some observation; some context: So the GWASs from which I am developing PRS are case/control psychiatric and neurological disorder GWASs. But I am optimizing them in my target cohort, not with regard to the same (binary) phenotype, but rather wrt continuous phenotypes derived from brain MRI. These are volumetric measures of the brains (regional volume) and 'integrity' of white-matter tracts. Both of which are well-behaved and quite Gaussian distributed.
In some cases, for example when I use a GWAS of the same MRI phenotypes to develop PRS using lassosum2, the L1/L2 penalties and optimization paths are as expected (i.e., very low L2). For example, this is a PRS generated from cortical surface area GWAS and then optimized in the target population using the same class of structural brain MRI phenotypes:
However, when I try to generate PRS using e.g. schizophrenia GWAS, I get different optimization paths for structural MRI (regional volume phenotype) versus diffusion MRI (white-matter integrity phenotype):
In the above plots, it is apparent that a strong L2 penalty increases the accuracy of the PRS in structural but not diffusion MRI.
As far as I understand, such a strong L2 would 'destroy' the LD contribution in suppressing variants that take part in the PRS model, making the model assume that all variants are independent. So I checked the Pearson correlation of each PRS generated in the grid search with an orthogonal method (SBayesR, dashed line) using another neurological GWAS (same as my last month's previous comment), and it turns out that the correlation is very high anyway for many of the grid points (r > 0.8-0.9).
I don't know what the exact reason is, but it seems that increasing L2 in some cases gives an extra boost of accuracy in generating a PRS from a neuropsychiatric case/control GWAS in explaining phenotypes derived from quantitative brain MRI.
Extra notes: calculating LD from 2000 subjects of the target population vs the off-the-shelf reference panels doesn't affect the accuracy at all, as doesn't the use of the top 3 million significant variants from the GWAS instead of the HapMap3+ variants (I find the latter very interesting).
As far as I understand, using a very strong L2 reg corresponds to using the identity matrix as LD ref, which corresponds to just using (some scaled version) of the GWAS effect sizes directly. This is clumping and thresholding without clumping nor thresholding.. i.e. simply using the GWAS effect sizes as predictive effects without any filtering. That should not be the optimal solution.
But because you're trying to predict something else, I don't know..
Are your questions answered, or do you still have a few more?
Following this conversation, I think I better understand how the software works, but I need to explore this particular use case and make sense of the results in the coming weeks. Thanks for following up.
Any update on this?
Hi, I am planning to use LDpred2 to build PGS in UK Biobank subjects. After variant QC, I have > ~9 million variants. I wonder if I will get any accuracy boost if I build the LD reference from two thousand individuals of the same cohort (instead of filtering to the HapMap plus panel, which leaves ~1.3M after matching with UKB).
So far, I could manage to create LD matrices following the example code for the smaller chromosomes (chr 15-20). However, for those chromosomes with > 300k variants, I receive a memory-related error when the below line tries to generate SFBM matrices:
corr <- as_SFBM(corr0, tmp)
caught segfault address 0x151d19847000, cause 'memory not mapped'
Traceback: 1: col_count_sym(spmat@p, spmat@i) 2: .Object$initialize(...) 3: initialize(value, ...) 4: initialize(value, ...) 5: methods::new("SFBM", spmat = spmat, backingfile = backingfile) 6: as_SFBM(corr0, tmp)
Any advice is greatly appreciated. Thanks, Sourena