Closed szhang1112 closed 1 year ago
E.g. if running LDpred2-auto with 50 chains parallelized over 13 cores, it should take less than one hour.
I am not sure where you use the 2K individuals as LDpred2 uses only summary information (i.e. no individual-level data).
What set of variants are you using? Which function takes this long to run for you?
I just split 2k into validation and testing sets and used summary stats from another study.
The code spent most of the time writing a .sbk file in the tmp-data folder - I am following your guideline. So I assume it is calculating the LD matrix? I am using variants of HM3 and here are more details:
1,362,962 variants to be matched. 0 ambiguous SNPs have been removed. 391,659 variants have been matched; 0 were flipped and 0 were reversed.
Are you computing the LD matrix from the 2K individuals? That should not take long, especially for less than 400K variants. Are you using parallelism? Did you properly compute and use the variant positions in cM? I would recommend that you use the precomputed LD provided.
PS: The number of variants matched is a bit small; don't you have imputed data?
Are you computing the LD matrix from the 2K individuals? That should not take long, especially for less than 400K variants. Are you using parallelism? Did you properly compute and use the variant positions in cM? I would recommend that you use the precomputed LD provided.
PS: The number of variants matched is a bit small; don't you have imputed data?
Yes. I will try to increase the number of cores. I just realized I used too few CPUs.
Thanks for your comment - this is the number after QCs (eg with geno and maf cutoffs) for the 2k samples which are imputed UKBB samples, so only a limited number of SNPs left.
BTW, how to use the precomputed LD?
It looks even if I increased core no to 23, it is still computating LD for chr1 after 2 hours. Is it normal? I allocated 20G mem for each core.
Have a look at the tutorial; the precomputed LD is mentioned there.
Please try computing the LD with snp_cor()
on chr22 first, and report the object.size()
of the resulting object and the time it took.
Have a look at the tutorial; the precomputed LD is mentioned there.
Please try computing the LD with
snp_cor()
on chr22 first, and report theobject.size()
of the resulting object and the time it took.
After testing, I found that running this function is most time-consuming: as_SFBM(). I am not so familiar with R but do you have any idea?
It's as_SFBM()
that takes most of the time to run??
What do you have for packageVersion("bigsparser")
?
What is the size of the object you get from snp_cor()
?
Where are you trying to write the disk file in as_SFBM()
? You should not use the temporary directory, but a directory where you can quickly write large files.
It's
as_SFBM()
that takes most of the time to run?? What do you have forpackageVersion("bigsparser")
? What is the size of the object you get fromsnp_cor()
? Where are you trying to write the disk file inas_SFBM()
? You should not use the temporary directory, but a directory where you can quickly write large files.
Thanks for your advice and I figured it out!
Can you tell what was the problem? It might be useful for others here.
Can you tell what was the problem? It might be useful for others here.
Exactly like you said: I was writing data to a network based storage which is extremely slow. It works after I changed to write to a local disk.
Can you tell what was the problem? It might be useful for others here.
Exactly like you said: I was writing data to a network based storage which is extremely slow. It works after I changed to write to a local disk.
Hi there, i met the same problem from as_SFBM, could you please provide more info about how you figured out the problem? I was submitting a job to the cluster and the backfile was generated to the cluster space as well.
It's
as_SFBM()
that takes most of the time to run?? What do you have forpackageVersion("bigsparser")
? What is the size of the object you get fromsnp_cor()
? Where are you trying to write the disk file inas_SFBM()
? You should not use the temporary directory, but a directory where you can quickly write large files.
Hi Privefl, i was also wondering if there is an alternative way to convert the dgcmatrix file to sfbm, since i was not run the code as_sfbm smoothly.... Can i convert by Matrix package? Thanks in advance!
You need to provide a backingfile
to be stored on some partition with fast disk access.
packageVersion("bigsparser") Yes I already provided. It works for chromosome with less variants, but doesn't for chromosome 1, which has about 573,606 variants in my dataset. i recevied the error of "address 0x7f5f61c77000, cause 'memory not mapped'. Do you have any idea about this? I do have computed corr0 for each chromosome with dgcmatrix file.
?
? Sorry for the unclear description... I am talking about the step of LD matrix, to compute
corr
variable that will be used to calculate PRS.
My problem is even though I have added thebackingfile
into the code, It still doesn't work to generate the SFBM file by as_SFBM
function. I guess it might because I could only use one core in the server.
So I was wondering is that possible to merge the 22corr0
, which are in dgCMatrix
format, in another way instead of using SFBM
? My second question is, can I convert the dgCMatrix
of Chr22 for example to SFBM
, and then add_column
for other chromosomes? Does the chromosome order in SFBM file matter?
thanks.
No, I didn't implement anything else than with the SFBM. And I think it would be very inefficient to use an in-memory sparse matrix, due to parallelization.
But the question is really where you choose the backingfile
to be. You should choose somewhere where I/O is fast, and then there should be no problem and it should be particularly fast (I think no more than 10 min for the 22 chromosomes). You can try first to convert chr22 only.
No, I didn't implement anything else than with the SFBM. And I think it would be very inefficient to use an in-memory sparse matrix, due to parallelization.
But the question is really where you choose the
backingfile
to be. You should choose somewhere where I/O is fast, and then there should be no problem and it should be particularly fast (I think no more than 10 min for the 22 chromosomes). You can try first to convert chr22 only.
Okay thanks for your reply. I 'll try chr22 then, that's exactly my other question.
Hi, I just wonder what is the general speed for running ldpred2? I am running it for 2k samples with 512G mem which takes over one week. Is there a way to speed up the process? Thanks!