privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Issues with LDpred2 crashing when using a large number of SNPs #324

Closed lindseysinclair closed 2 years ago

lindseysinclair commented 2 years ago

Dear Florian

I'm really sorry to have to contact you, but I have been having all sorts of problems getting LDpred2 to work on my dataset. I am using R 4.1.2 on our high performance computer with a 380GB node. I have based my code on your tutorial.

I have a base dataset which is the Wray et al 2018 depression GWAS and an imputed target dataset from a population study with about 1000 people. Once I can get it working for this I intend to do the same thing for my other datasets from slightly different studies. Basically I want to be able to calculate a depression PRS for each individual in my target dataset.

I have QCd both the base and target data using stringent thresholds and then using the snp_match command I end up with just over 4.2 million matched SNPs. After a bit of further QC I have about 4 million SNPs left.

It then all goes wrong at the corr0<- line

Basically I keep getting the error that the cholmod problem is too large.

A research software engineer from my university kindly looked into this for me and his conclusions are pasted below. Basically it looks to me like LDPred2 has a ceiling for the number of SNPs that can be put into it. Is this right? If not I would be so grateful if you could please let me know what I can do to get this working!

Thanks ever so much

Lindsey

Summary

Using the data provided............ the script BDR_Rscript_latest.R fails with an error in Matrix::colSums(), when it is called with the square of the correlation matrix calculated by snp_cor() as argument. This occurs for some chromosomes (including chromosome 1). The error produced (with traceback) is

Error in asMethod(object) :

Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 89

Calls: ... -> .local -> as_gCsimpl -> as -> asMethod

6: asMethod(object)

5: as(from, paste0(.M.kind(from), "gCMatrix"))

4: as_gCsimpl(x)

3: .local(x, na.rm, dims, ...)

2: Matrix::colSums(corr0^2)

1: Matrix::colSums(corr0^2)

Since we know chromosome 1 causes this error, I restricted my testing to chromosome 1.

The region of R code in which the error occurs for chromosome 1 is

chr <- 1

ind.chr <- which(df_beta$chr == chr)

ind.chr2 <- df_beta$_NUM_ID_[ind.chr]

corr0 <- snp_cor(G, ind.col = ind.chr2, ncores = NCORES,

             infos.pos = POS2[ind.chr2], size = 3 / 1000)

ld <- Matrix::colSums(corr0^2)

corr <- as_SFBM(corr0, tmp, compact = TRUE)

I was able to save the result of snp_cor() (corr0) as an RDS file prior to the error in Matrix::colSums(), which allowed me to interrogate the object.

I found that chromosome 1 corr0 was an object of class dsCMatrix, i.e. an in-memory symmetric sparse matrix from the Matrix package. It has dimensions 357451 by 357451 and 1744936754 is half the number of non-zero elements, nnZ (only half need to be stored because the matrix is symmetric).

1744936754 8-byte floating point numbers occupy just over 13 GiB. To store the entire dense matrix (357451^2 elements) as 8-byte floating point numbers would require nearly 952 GiB memory, so the sparse representation saves a lot of space!

The sparse matrix (which occupies 19.5 GiB memory) is successfully created by snp_cor(), provided enough RAM is requested in the job submission script but the subsequent Matrix::colSums() operation fails with the error above.

Looking at the location in the CHOLMOD source code where the error occurs, I can see that the error is triggered by a check for integer overflow in the dimensions or number of non-zero elements in a sparse matrix. CHOLMOD is used by R’s Matrix package behind-the-scenes.

To test the limits of Matrix::colSums(), I created a script that produces various dsCMatrix sparse matrices of increasing size/increasing number of non-zero elements. This testing indicated that the problem with the corr0 matrix for chromosome 1 is the number of non-zero elements is too large for CHOLMOD to handle.

It appears that the maximum number of non-zero elements is determined by the maximum value of a signed integer in CHOLMOD, INT_MAX (which itself depends on how the software was built). Signed integers are often 4-bytes, and have an INT_MAX +2147483647.

In testing, sparse symmetric matrices with less than 4-byte INT_MAX non-zero elements did not cause Matrix::colSums() to fail, while matrices with greater than 4-byte INT_MAX caused Matrix::colSums() to produce the “problem too large” error.

corr0 for chromosome 1 has 1744936754 * 2 non-zero elements, which is larger than INT_MAX. I think this is why Matrix::colSums() produces the error message in this case.

Conclusion

The current code will only work where the number of non-zero elements in the sparse matrix passed to Matrix::colSums() is smaller than INT_MAX (+2147483647). For a symmetric sparse matrix, the maximum amount of non-zero data stored will be half INT_MAX.

privefl commented 2 years ago

Yes, this problem has been reported and discussed before. There are many solutions to it, but the only one that is currently recommended is to restrict to HapMap3 variants. There is no evidence that using more variants will lead to improved predictive performance.

lindseysinclair commented 2 years ago

Thanks Florian. That worked a treat. Do you happen to know what the maximum number of SNPs that can be reasonably used is?

privefl commented 2 years ago

I guess the limitation number is what you've suggested, which seems to happen when you have around 400K variants per chromosome.

But this depends on many things, e.g. the window size (if you use e.g. 2 cM instead of 3 cM), if you define additional blocks (to reduce to e.g. 60-70% of non-zero values), etc.

But the point is that this was not validated and can actually reduce predictive power when M/N is too large.