privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

Cholmod error 'problem too large' #89

Closed Minta821 closed 3 years ago

Minta821 commented 4 years ago

Hi, While I was running the code (ldsc <- snp_ldsc2(corr0, df_beta)), it throws an error as follows: Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92

It performed successfully with the example data sets, Do you have nay solution for this problem, please suggest. Regards, Minta

privefl commented 4 years ago

Thanks for reporting.

What is the size of corr0?

Minta821 commented 4 years ago

Hi, The size of the corr0 is 178774 times 178774. It's just for Chr 22 The size will increase while we move to bigger chromosomes. Regards, Minta

privefl commented 4 years ago

Using the data from the vignette, I can run this without error

ind.chr2 <- rep(info_snp$`_NUM_ID_`[ind.chr], each = 4)
corr0 <- snp_cor(G, ind.col = ind.chr2, ncores = NCORES, 
                 infos.pos = POS2[ind.chr2], size = 1 / 1000)
(ldsc <- snp_ldsc2(corr0, df_beta[rep(rows_along(df_beta), each = 4), ]))

Can you try to run this as well? But maybe it has not to do with the size but rather the number of non-zero elements. Could you also report length(corr0@i) / 1024**3?

Minta821 commented 4 years ago

It's ended up with another error: Error in validityMethod(as(object, superClass)) : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522 length(corr0@i) / 1024**3= 1.217868, for the initial correlation matrix with size 178774

privefl commented 4 years ago

What version of packages {bigsnpr} and {Matrix} do you have?

Minta821 commented 4 years ago

packageVersion('bigsnpr') ‘1.4.2’ packageVersion('Matrix') ‘1.2.18’

privefl commented 4 years ago

Do you have a weird architecture (e.g. 32 bits)? I am not able to reproduce this.

privefl commented 4 years ago

Have you google similar issues such as https://community.rstudio.com/t/error-in-r-long-vectors-not-supported-yet-svm/50801/2?

Minta821 commented 4 years ago

It's 64 bit, not 32

privefl commented 4 years ago

Have you looked at the similar issues I mentioned before? I don't think it has to do with this package.

Minta821 commented 4 years ago

Hi, Thank you for the follow-up. Yes, it's a similar problems which you mentioned above. I am not sure how to resolve the same.

Instead, I tried other solutions. If I used only Hapmap-3 based SNPs, it's working well with each and every steps. Our summary statistics is based on HRC SNPs, we cannot include all these SNPs, but a set of QC process would help to reduce the number and eventually it would work..

char4816 commented 3 years ago

I am also running into this issue as I attempt to run on larger chromosomes. I am using imputed data with general PRS QC filtering resulting in about 4.8M variants that match with my summary statistics across all 22 chromosomes. I apply your suggested QC on these SNPs and remove an additional 10,652 SNPs. After removing these SNPs I am looking at the following counts of matched SNPs for each chromosome: image

Everything runs well for the smaller chromosomes such as chr21 and chr22 (which have around 60K matched SNPs), but begins to fail for larger chromosomes starting around the chromosomes with more than ~180K SNPs (such as chr1-chr13) with a similar error as was described in this thread.

Subsetting to a smaller number of individuals doesn't resolve the error, it seems to happen when the number of SNPs of a given chromosome becomes too large.

I tried giving the job plenty of RAM (>500GB) and I can tell by monitoring the RAM usage, the job isn't running out of memory. At chromosome 13, for example, where I produced this error on 187,266 SNPs the job used a max of about 137GB RAM.

Here is the specific line with the error:

> corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix"))
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
Calls: <Anonymous> -> assert_class -> as -> asMethod
Execution halted

Do you have a dataset with enough SNPs to reproduce this error?

Thanks, and let me know if I am missing something or if there is anything else I can do to help troubleshoot from my end. Chris

privefl commented 3 years ago

No, you're not missing anything. I think there are just too many non-zero values in corr0.

What is dim(corr0) and length(corr0@x)? Is it as(corr0, "dgCMatrix") that fails?

char4816 commented 3 years ago

Thanks for getting back to me. When I am running the larger chromosomes, this is where the error occurs:

>     corr0 <- snp_cor(GG, ind.col = ind.chr_my_cur_chrom, ncores = NCORES,
                 infos.pos = POS2[ind.chr_my_cur_chrom], size = 3 / 1000)
>     print(dim(corr0))
[1] 187266 187266
>     print(length(corr0@x))
[1] 1179750324
>     as(corr0, "dgCMatrix")
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
>     corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix"))
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
privefl commented 3 years ago

I'm relying on the {Matrix} package to do that. I'll see if I can reproduce the issue and find another way to do this.

char4816 commented 3 years ago

After some poking around on the internet, it seems like this really is a memory issue in the conversion of corr0 to the sparse matrix in as(corr0, "dgCMatrix"). But that would mean this conversion requires unreasonably large memory requirements when a couple hundred thousand SNPs per chromosome isn't uncommon... I'm sure you have a much better understanding of this than I do, but maybe further partitioning the chromosomes could get around this? However, splitting the chromosomes at more places than the centromere would complicate the 3cM window and LD computations.

Maybe when people have more dense data like mine, there could be a solution around partitioning the chromosome into n parts with overlapping 3cM segments (and then accounting for those overlaps downstream) or something like that.

Let me know if there is anything I can do to help. Chris

privefl commented 3 years ago

I'll try to reproduce and find an alternative way to do this.

In the meantine, I strongly recommend to use HapMap3 SNPs only, as recommend in both the paper and tutorial.

privefl commented 3 years ago

I've tried to reproduce this in two datasets. In the second one, I used 200K variants on chr1 in the UK Biobank. The (upper) correlation matrix has 535M non-zero elements, which is less than twice what you have. Are you using a 3cM window?

I'll try again with 300K or 400K variants I guess.

privefl commented 3 years ago

I was able to reproduce with 400K variants, giving 2140M non-zero elements.

char4816 commented 3 years ago

Thanks for keeping us updated on your progress. I understand your point about the hapmap3 SNPs and how that would make the SNP count more manageable. I am using a 3cM window.

The question of if more SNPs improves PRS likely depends on the disease of interest and the underlying architecture. In general, I would expect that more SNPs is better but probably at a certain point adding more SNPs doesn't improve anything. And that would vary a bit by data set and disease. I have a hunch that for complex and multifactor diseases (like asthma), the more SNPs included in the model, the better the PRS will be (to an extent). This is just a hunch but is based on the improvement in AUC I previously observed when I was running ldpred1 before and after imputation -- the imputed dataset produced much more accurate PRSs. I also think that the question of SNP count relative to the genetic architecture is interesting in light of how asthma showed the least improvement between ldpred1 and ldpred2 in your paper. I wonder if the inclusion of more than 1.1M SNPs would yield a better AUC for phenotypes like asthma. From our end, we have gathered even more densely imputed data and would really like to be able to maximize the overlap between our array data and the summary statistics (even in the tens-of-millions-of-SNPs).

Please let me know if you have any thoughts on this, Chris

privefl commented 3 years ago

Yes, you may be right.

After 2 days of work, I think I managed to do this without having to convert to a temporary dgCMatrix.

Please try remotes::install_github("privefl/bigsnpr"). You should be able to directly do as_SFBM(corr0) instead of bigsparser::as_SFBM(as(corr0, "dgCMatrix")) now. The conversion takes 23 min for the matrix with 2140M elements; it should take 10-15 min for you hopefully.

char4816 commented 3 years ago

Sounds great! I am giving it a try right now. Thank you very much for your hard work on this.

Minta821 commented 3 years ago

Thanks Florian, I will also try it.

gaguerra commented 3 years ago

I will test it out as well. Thank you Florian!

char4816 commented 3 years ago

I am trying on a smaller chromosome (chr22) which previously didn't produce this error and on a larger chromosome (chr13) which previously did produce this errror. I updated the bigsnpr package and the bigsparser packages after your latest commits on 8/11.

For both chr22 and chr13 we are now getting past corr <- as_SFBM(corr0) in the vignette which is good.

For chr22: I am having success through the infinitesimal model and the grid of models. However, for ldpred2 auto, I am now encountering the following error at snp_ldpred2_auto:

> multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = h2_est,
+                                vec_p_init = seq_log(1e-4, 0.9, length.out = NCORES),
+                                ncores = NCORES)
Error in getXPtrSFBM(path = .self$backingfile, n = .self$nrow, m = .self$ncol,  : 
  Error when mapping file:
  No such file or directory.

For chr13: As I begin the infinitesimal model as I compute h2, I run into the original error again at this step:

> (ldsc <- snp_ldsc2(corr0, df_beta))
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
Calls: snp_ldsc2 ... <Anonymous> -> .local -> as_gCsimpl -> as -> asMethod
privefl commented 3 years ago

1/ You're using a temporary file for corr$sbk by default, this can get deleted for any reason. Not sure how this could happen in the same session though. You can always specify a backingfile in as_SFBM() and delete it later, as I'm doing in the paper: https://github.com/privefl/paper-ldpred2/blob/master/code/run-ldpred2.R#L66-L68.

2/ Ok, I guess I should find an alternative for snp_ldsc2() as well.

privefl commented 3 years ago

For 2/, this should now work with the latest version.

char4816 commented 3 years ago

Thanks Florian, my apologies for bringing up error 1/, I should have been able to figure that out on my own -- thanks for pointing out that code, everything is working smoothly now regarding that.

The (hopefully) last error that is now popping up is at

> multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = h2_est,
+                                vec_p_init = seq_log(1e-4, 0.9, length.out = NCORES),
+                                ncores = NCORES)
Error in { : 
  task 1 failed - "could not find function "ldpred2_gibbs_auto""

Both chr13 and chr22 are running well until this point (which means I think you solved all of the other issues regarding 'problem too large')

gaguerra commented 3 years ago

I ran into the same issue with "could not find function "ldpred2_gibbs_auto"" and was only able to fix by setting ncores = 1.

privefl commented 3 years ago

I'm very confused on how this could have happened as I basically haven't modified the LDpred2 functions. I'll look into it.

privefl commented 3 years ago

This was some weird error about running a package code without the package being installed (this error still really confuses me, cf. https://stackoverflow.com/a/59292935/6103040).

Anyway, I've just pushed a workaround; this should work now.

char4816 commented 3 years ago

Awesome work and thank you again for your support. Everything is running smoothly for me so far today!

privefl commented 3 years ago

Thanks for letting me know about these issues, and for your patience.

I think I can close this now.

privefl commented 3 years ago

Thank you for using LDpred2. Please note that we now recommend running LDpred2 genome-wide instead of per chromosome. The paper (preprint) and tutorial have been updated.