privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

Slow running #415

Closed szhang1112 closed 1 year ago

szhang1112 commented 1 year ago

Hi, I just wonder what is the general speed for running ldpred2? I am running it for 2k samples with 512G mem which takes over one week. Is there a way to speed up the process? Thanks!

privefl commented 1 year ago

E.g. if running LDpred2-auto with 50 chains parallelized over 13 cores, it should take less than one hour.

I am not sure where you use the 2K individuals as LDpred2 uses only summary information (i.e. no individual-level data).

What set of variants are you using? Which function takes this long to run for you?

szhang1112 commented 1 year ago

I just split 2k into validation and testing sets and used summary stats from another study.

The code spent most of the time writing a .sbk file in the tmp-data folder - I am following your guideline. So I assume it is calculating the LD matrix? I am using variants of HM3 and here are more details:

1,362,962 variants to be matched. 0 ambiguous SNPs have been removed. 391,659 variants have been matched; 0 were flipped and 0 were reversed.

privefl commented 1 year ago

Are you computing the LD matrix from the 2K individuals? That should not take long, especially for less than 400K variants. Are you using parallelism? Did you properly compute and use the variant positions in cM? I would recommend that you use the precomputed LD provided.

PS: The number of variants matched is a bit small; don't you have imputed data?

szhang1112 commented 1 year ago

Are you computing the LD matrix from the 2K individuals? That should not take long, especially for less than 400K variants. Are you using parallelism? Did you properly compute and use the variant positions in cM? I would recommend that you use the precomputed LD provided.

PS: The number of variants matched is a bit small; don't you have imputed data?

Yes. I will try to increase the number of cores. I just realized I used too few CPUs.

Thanks for your comment - this is the number after QCs (eg with geno and maf cutoffs) for the 2k samples which are imputed UKBB samples, so only a limited number of SNPs left.

BTW, how to use the precomputed LD?

szhang1112 commented 1 year ago

It looks even if I increased core no to 23, it is still computating LD for chr1 after 2 hours. Is it normal? I allocated 20G mem for each core.

privefl commented 1 year ago

Have a look at the tutorial; the precomputed LD is mentioned there.

Please try computing the LD with snp_cor() on chr22 first, and report the object.size() of the resulting object and the time it took.

szhang1112 commented 1 year ago

Have a look at the tutorial; the precomputed LD is mentioned there.

Please try computing the LD with snp_cor() on chr22 first, and report the object.size() of the resulting object and the time it took.

After testing, I found that running this function is most time-consuming: as_SFBM(). I am not so familiar with R but do you have any idea?

privefl commented 1 year ago

It's as_SFBM() that takes most of the time to run?? What do you have for packageVersion("bigsparser")? What is the size of the object you get from snp_cor()? Where are you trying to write the disk file in as_SFBM()? You should not use the temporary directory, but a directory where you can quickly write large files.

szhang1112 commented 1 year ago

It's as_SFBM() that takes most of the time to run?? What do you have for packageVersion("bigsparser")? What is the size of the object you get from snp_cor()? Where are you trying to write the disk file in as_SFBM()? You should not use the temporary directory, but a directory where you can quickly write large files.

Thanks for your advice and I figured it out!

privefl commented 1 year ago

Can you tell what was the problem? It might be useful for others here.

szhang1112 commented 1 year ago

Can you tell what was the problem? It might be useful for others here.

Exactly like you said: I was writing data to a network based storage which is extremely slow. It works after I changed to write to a local disk.

SoleilChenxu commented 1 year ago

Can you tell what was the problem? It might be useful for others here.

Exactly like you said: I was writing data to a network based storage which is extremely slow. It works after I changed to write to a local disk.

Hi there, i met the same problem from as_SFBM, could you please provide more info about how you figured out the problem? I was submitting a job to the cluster and the backfile was generated to the cluster space as well.

SoleilChenxu commented 1 year ago

It's as_SFBM() that takes most of the time to run?? What do you have for packageVersion("bigsparser")? What is the size of the object you get from snp_cor()? Where are you trying to write the disk file in as_SFBM()? You should not use the temporary directory, but a directory where you can quickly write large files.

Hi Privefl, i was also wondering if there is an alternative way to convert the dgcmatrix file to sfbm, since i was not run the code as_sfbm smoothly.... Can i convert by Matrix package? Thanks in advance!

privefl commented 1 year ago

You need to provide a backingfile to be stored on some partition with fast disk access.

SoleilChenxu commented 1 year ago

packageVersion("bigsparser") Yes I already provided. It works for chromosome with less variants, but doesn't for chromosome 1, which has about 573,606 variants in my dataset. i recevied the error of "address 0x7f5f61c77000, cause 'memory not mapped'. Do you have any idea about this? I do have computed corr0 for each chromosome with dgcmatrix file.

privefl commented 1 year ago

?

SoleilChenxu commented 1 year ago

? Sorry for the unclear description... I am talking about the step of LD matrix, to compute corrvariable that will be used to calculate PRS.

My problem is even though I have added thebackingfile into the code, It still doesn't work to generate the SFBM file by as_SFBM function. I guess it might because I could only use one core in the server.

So I was wondering is that possible to merge the 22corr0, which are in dgCMatrix format, in another way instead of using SFBM? My second question is, can I convert the dgCMatrixof Chr22 for example to SFBM , and then add_column for other chromosomes? Does the chromosome order in SFBM file matter?

thanks.

privefl commented 1 year ago

No, I didn't implement anything else than with the SFBM. And I think it would be very inefficient to use an in-memory sparse matrix, due to parallelization.

But the question is really where you choose the backingfile to be. You should choose somewhere where I/O is fast, and then there should be no problem and it should be particularly fast (I think no more than 10 min for the 22 chromosomes). You can try first to convert chr22 only.

SoleilChenxu commented 1 year ago

No, I didn't implement anything else than with the SFBM. And I think it would be very inefficient to use an in-memory sparse matrix, due to parallelization.

But the question is really where you choose the backingfile to be. You should choose somewhere where I/O is fast, and then there should be no problem and it should be particularly fast (I think no more than 10 min for the 22 chromosomes). You can try first to convert chr22 only.

Okay thanks for your reply. I 'll try chr22 then, that's exactly my other question.