Closed Minta821 closed 3 years ago
Thanks for reporting.
What is the size of corr0
?
Hi, The size of the corr0 is 178774 times 178774. It's just for Chr 22 The size will increase while we move to bigger chromosomes. Regards, Minta
Using the data from the vignette, I can run this without error
ind.chr2 <- rep(info_snp$`_NUM_ID_`[ind.chr], each = 4)
corr0 <- snp_cor(G, ind.col = ind.chr2, ncores = NCORES,
infos.pos = POS2[ind.chr2], size = 1 / 1000)
(ldsc <- snp_ldsc2(corr0, df_beta[rep(rows_along(df_beta), each = 4), ]))
Can you try to run this as well?
But maybe it has not to do with the size but rather the number of non-zero elements. Could you also report length(corr0@i) / 1024**3
?
It's ended up with another error: Error in validityMethod(as(object, superClass)) : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522 length(corr0@i) / 1024**3= 1.217868, for the initial correlation matrix with size 178774
What version of packages {bigsnpr} and {Matrix} do you have?
packageVersion('bigsnpr') ‘1.4.2’ packageVersion('Matrix') ‘1.2.18’
Do you have a weird architecture (e.g. 32 bits)? I am not able to reproduce this.
Have you google similar issues such as https://community.rstudio.com/t/error-in-r-long-vectors-not-supported-yet-svm/50801/2?
It's 64 bit, not 32
Have you looked at the similar issues I mentioned before? I don't think it has to do with this package.
Hi, Thank you for the follow-up. Yes, it's a similar problems which you mentioned above. I am not sure how to resolve the same.
Instead, I tried other solutions. If I used only Hapmap-3 based SNPs, it's working well with each and every steps. Our summary statistics is based on HRC SNPs, we cannot include all these SNPs, but a set of QC process would help to reduce the number and eventually it would work..
I am also running into this issue as I attempt to run on larger chromosomes. I am using imputed data with general PRS QC filtering resulting in about 4.8M variants that match with my summary statistics across all 22 chromosomes. I apply your suggested QC on these SNPs and remove an additional 10,652 SNPs. After removing these SNPs I am looking at the following counts of matched SNPs for each chromosome:
Everything runs well for the smaller chromosomes such as chr21 and chr22 (which have around 60K matched SNPs), but begins to fail for larger chromosomes starting around the chromosomes with more than ~180K SNPs (such as chr1-chr13) with a similar error as was described in this thread.
Subsetting to a smaller number of individuals doesn't resolve the error, it seems to happen when the number of SNPs of a given chromosome becomes too large.
I tried giving the job plenty of RAM (>500GB) and I can tell by monitoring the RAM usage, the job isn't running out of memory. At chromosome 13, for example, where I produced this error on 187,266 SNPs the job used a max of about 137GB RAM.
Here is the specific line with the error:
> corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix"))
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
Calls: <Anonymous> -> assert_class -> as -> asMethod
Execution halted
Do you have a dataset with enough SNPs to reproduce this error?
Thanks, and let me know if I am missing something or if there is anything else I can do to help troubleshoot from my end. Chris
No, you're not missing anything. I think there are just too many non-zero values in corr0
.
What is dim(corr0)
and length(corr0@x)
?
Is it as(corr0, "dgCMatrix")
that fails?
Thanks for getting back to me. When I am running the larger chromosomes, this is where the error occurs:
> corr0 <- snp_cor(GG, ind.col = ind.chr_my_cur_chrom, ncores = NCORES,
infos.pos = POS2[ind.chr_my_cur_chrom], size = 3 / 1000)
> print(dim(corr0))
[1] 187266 187266
> print(length(corr0@x))
[1] 1179750324
> as(corr0, "dgCMatrix")
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
> corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix"))
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
I'm relying on the {Matrix} package to do that. I'll see if I can reproduce the issue and find another way to do this.
After some poking around on the internet, it seems like this really is a memory issue in the conversion of corr0 to the sparse matrix in as(corr0, "dgCMatrix")
. But that would mean this conversion requires unreasonably large memory requirements when a couple hundred thousand SNPs per chromosome isn't uncommon... I'm sure you have a much better understanding of this than I do, but maybe further partitioning the chromosomes could get around this? However, splitting the chromosomes at more places than the centromere would complicate the 3cM window and LD computations.
Maybe when people have more dense data like mine, there could be a solution around partitioning the chromosome into n parts with overlapping 3cM segments (and then accounting for those overlaps downstream) or something like that.
Let me know if there is anything I can do to help. Chris
I'll try to reproduce and find an alternative way to do this.
In the meantine, I strongly recommend to use HapMap3 SNPs only, as recommend in both the paper and tutorial.
I've tried to reproduce this in two datasets. In the second one, I used 200K variants on chr1 in the UK Biobank. The (upper) correlation matrix has 535M non-zero elements, which is less than twice what you have. Are you using a 3cM window?
I'll try again with 300K or 400K variants I guess.
I was able to reproduce with 400K variants, giving 2140M non-zero elements.
Thanks for keeping us updated on your progress. I understand your point about the hapmap3 SNPs and how that would make the SNP count more manageable. I am using a 3cM window.
The question of if more SNPs improves PRS likely depends on the disease of interest and the underlying architecture. In general, I would expect that more SNPs is better but probably at a certain point adding more SNPs doesn't improve anything. And that would vary a bit by data set and disease. I have a hunch that for complex and multifactor diseases (like asthma), the more SNPs included in the model, the better the PRS will be (to an extent). This is just a hunch but is based on the improvement in AUC I previously observed when I was running ldpred1 before and after imputation -- the imputed dataset produced much more accurate PRSs. I also think that the question of SNP count relative to the genetic architecture is interesting in light of how asthma showed the least improvement between ldpred1 and ldpred2 in your paper. I wonder if the inclusion of more than 1.1M SNPs would yield a better AUC for phenotypes like asthma. From our end, we have gathered even more densely imputed data and would really like to be able to maximize the overlap between our array data and the summary statistics (even in the tens-of-millions-of-SNPs).
Please let me know if you have any thoughts on this, Chris
Yes, you may be right.
After 2 days of work, I think I managed to do this without having to convert to a temporary dgCMatrix.
Please try remotes::install_github("privefl/bigsnpr")
.
You should be able to directly do as_SFBM(corr0)
instead of bigsparser::as_SFBM(as(corr0, "dgCMatrix"))
now.
The conversion takes 23 min for the matrix with 2140M elements; it should take 10-15 min for you hopefully.
Sounds great! I am giving it a try right now. Thank you very much for your hard work on this.
Thanks Florian, I will also try it.
I will test it out as well. Thank you Florian!
I am trying on a smaller chromosome (chr22) which previously didn't produce this error and on a larger chromosome (chr13) which previously did produce this errror. I updated the bigsnpr
package and the bigsparser
packages after your latest commits on 8/11.
For both chr22 and chr13 we are now getting past corr <- as_SFBM(corr0)
in the vignette which is good.
For chr22:
I am having success through the infinitesimal model and the grid of models. However, for ldpred2 auto, I am now encountering the following error at snp_ldpred2_auto
:
> multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = h2_est,
+ vec_p_init = seq_log(1e-4, 0.9, length.out = NCORES),
+ ncores = NCORES)
Error in getXPtrSFBM(path = .self$backingfile, n = .self$nrow, m = .self$ncol, :
Error when mapping file:
No such file or directory.
For chr13: As I begin the infinitesimal model as I compute h2, I run into the original error again at this step:
> (ldsc <- snp_ldsc2(corr0, df_beta))
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
Calls: snp_ldsc2 ... <Anonymous> -> .local -> as_gCsimpl -> as -> asMethod
1/ You're using a temporary file for corr$sbk
by default, this can get deleted for any reason. Not sure how this could happen in the same session though. You can always specify a backingfile
in as_SFBM()
and delete it later, as I'm doing in the paper: https://github.com/privefl/paper-ldpred2/blob/master/code/run-ldpred2.R#L66-L68.
2/ Ok, I guess I should find an alternative for snp_ldsc2()
as well.
For 2/, this should now work with the latest version.
Thanks Florian, my apologies for bringing up error 1/, I should have been able to figure that out on my own -- thanks for pointing out that code, everything is working smoothly now regarding that.
The (hopefully) last error that is now popping up is at
> multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = h2_est,
+ vec_p_init = seq_log(1e-4, 0.9, length.out = NCORES),
+ ncores = NCORES)
Error in { :
task 1 failed - "could not find function "ldpred2_gibbs_auto""
Both chr13 and chr22 are running well until this point (which means I think you solved all of the other issues regarding 'problem too large')
I ran into the same issue with "could not find function "ldpred2_gibbs_auto""
and was only able to fix by setting ncores = 1.
I'm very confused on how this could have happened as I basically haven't modified the LDpred2 functions. I'll look into it.
This was some weird error about running a package code without the package being installed (this error still really confuses me, cf. https://stackoverflow.com/a/59292935/6103040).
Anyway, I've just pushed a workaround; this should work now.
Awesome work and thank you again for your support. Everything is running smoothly for me so far today!
Thanks for letting me know about these issues, and for your patience.
I think I can close this now.
Thank you for using LDpred2. Please note that we now recommend running LDpred2 genome-wide instead of per chromosome. The paper (preprint) and tutorial have been updated.
Hi, While I was running the code (ldsc <- snp_ldsc2(corr0, df_beta)), it throws an error as follows: Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
It performed successfully with the example data sets, Do you have nay solution for this problem, please suggest. Regards, Minta