Closed GRealesM closed 4 years ago
The 1000G data in the tuto is just some fake data for the example, if you can use a larger reference panel (e.g. 10K indivi in UKBB or your data, it would probably be better).
You don't need to use subset()
, you can instead use parameter ind.row
in snp_cor()
.
Since you have the latest versions of the packages, you can do corr <- as_SFBM(corr0)
instead of corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix"))
, that will save you a bit of time and memory.
I've learned over the years that this is bad practice to write large files in \tmp
, you should rather make some tmp-data
directory and write there (as in https://github.com/privefl/paper-ldpred2/blob/master/code/run-ldpred2.R#L66-L68). I could probably do this in the tutorial as well. Does this solve your problem?
Thanks for your quick reply! Regarding (1) would a subset of European individuals from 1000Genomes Phase III reference panel (which I already have in bed, and would be easy for me to use) be a valid panel? Since I don't have access to UKBB individual data at the moment. For (2), (3), and (4), I will try to do that. Hopefully this will solve my problem, I'll let you know if it does. Thanks!
Just to clarify: I'm trying to obtain PGS models (ie. weighted effects, the output of snp_ldpred2_auto) using QC'ed summary statistics. I don't have an external individual-data dataset to tune the parameters, which is why I'm using ldpred2-auto.
For what I understand, in order to compute corr
, we need individual genotypes which can come from a panel.
Then, in the tuto you use 1000 genomes phase I, and you simulate a fake phenotype to test your model and choose the best from a number of models created using different h_est and init_p values. I understand that that's the reason why you say that the data is fake, but actually I only need the panel to compute the LD matrix corr
, and I believe it would be reasonable to use 1000G phase III to that end.
Is my rationale right?
Merci!
Hi,
Thanks for your interest in LDpred2. I recommend that you use a larger sample size for the LD reference than the 1000G data. I believe you'd ideally need between 2000-10000 individuals for your LD reference. Note, you can use the test/validation data as your LD reference without a (meaningful) risk of bias. Not sure what we can recommend if you don't have access to such data. We might consider to provide LD for a "good" subset of variants based on UKB in the future (this would not be individual-level data), but we currently do not support that. Also, this would be an annoyingly large file to download.
Best, Bjarni
I don't remember exactly, but the 22 LD matrices based on HM3 variants are like 3GB total.
That's not too bad! Maybe we can make them available to users. It would also avoid issues related to poorly QC'd LD references, etc.
I don't know.. with all the issues reported for some variants in the UKBB lately, I'm not even sure that it would be a very good ref for now.
That sounds like a good idea, Florian. Would it be possible for you to make LD matrices based on HapMap3 variants available for users? I think that would largely solve my problem, and would also be helpful for other users who don't have a big panel at hand to compute LD matrices from. Thanks a lot for your comments and help!
I'll think about it. As I said, UKBB data is not perfect at all.
Does writing to another directory than "\tmp" solve your problem?
I will try that, but as Bjarni said, my 1000Genomes panel is not large enough for generating the LD matrices, so I'm not sure if it's even worth the try.
It is large enough to perform the analysis, just you would probably get some slightly better results with a larger ref panel.
Ok, I will try to run with it and I'll let you know.
But what are you doing these PGS? If you test the prediction somwhere else, you could maybe use this data for computing the LD.
I'm comparing LDpred2-auto performance with our method RápidoPGS, using an evaluation method that uses summary-level data only. I will test their prediction using individual data in the future, but right now I'm interested in how the models compare at the summary-statistic level. I will keep in mind though that with bigger individual datasets, LDpred2-auto results will likely improve. I'll try to get the 1KG panel up and running and I'll let you know if I managed to fix the issue. Thanks a lot, both of you for your comments and suggestions.
I'll likely make an LD reference available in the future. But, as we have some issues with the cluster at the moment, it might not be before 1-2 months.
Ok, it worked when I explicitly added a line to remove the temp file after each iteration, thus releasing memory in /tmp.
temp_file <- tempfile()
[...]
unlink(paste0(temp_file,"*"))
Thanks!
Please note that I've just released some LD reference here: https://doi.org/10.6084/m9.figshare.13034123. There is also one example script on how to use it there. A new version of the paper should appear on bioRxiv tomorrow.
Thank you for using LDpred2. Please note that we now recommend running LDpred2 genome-wide instead of per chromosome. The paper (preprint) and tutorial have been updated.
Hi Florian,
I hope you're well. I've been trying to run LDpred2 after implementing the QC steps that you suggested, and I included multiple init_p values for it to fully benefit for parallel computing. Also, I used 1000G Phase I reference panel following the tutorial, but filtered by the HapMap3 variants at the bottom, instead of thinning. I tried to compute it on our HPC both interactively and as a slurm job. See below for my sessionInfo() and the function I'm running:
And my function:
However, I keep getting an error along the lines of the one below, which resembles that one of #96 :
I investigated the issue (maybe the answer is the same as for #96) and I believe that it has something to do with the fact that bigsparser::as_SFBM writes the matrix to disk (in order not to overload RAM, I presume). I believe that the problem arises when the files, stored at /tmp (correct me if I'm wrong!) reach 16GB, which is the limit in my case; then, everything collapses.
I'd be very interested to know if you agree with my diagnostic, if there's an alternative way to do it (maybe a way to provide snp_ldpred2_auto with a matrix from RAM, rather than one written to the disk? I know this might be harder to implement, any other idea?), and if my code for running LDpred2 is more or less correct, so I can run LDpred2-auto as accurately as possible.
Thanks again for your useful comments on our previous attempt. Cheers!