ghm17 / LOGODetect

LOGODetect is a powerful tool to identify small segments that harbor local genetic correlation between two traits/diseases.
GNU General Public License v3.0
23 stars 5 forks source link

LOGODetect running time much longer than expectation #26

Closed yd357 closed 4 months ago

yd357 commented 8 months ago

Hi expert,

Thank you for your wonderful tools!

I am Yikai, a student from New Haven, and I have been using your tool for cross-population analysis in order to detect local correlated regions. However, I've encountered some challenges during the process.

Following the provided example data and codes, I attempted to run the analysis. However, it took considerably longer than anticipated. Here are the settings I used:

SBATCH --mem=512G

SBATCH -p week

SBATCH -t 48:00:00

SBATCH -c 20

module load R/4.3.0-foss-2020b

cd example

mkdir -p ./results/LOGODetect

Rscript ./X-Wing/LOGODetect.R \ --sumstats ./data/sumstats/BMI_EUR.txt,./data/sumstats/BMI_EAS.txt \ --n_gwas 359983,158284 \ --ref_dir ./data/LOGODetect_1kg_ref \ --pop EUR,EAS \ --block_partition ./X-Wing/block_partition.txt \ --gc_snp ./X-Wing/1kg_hm3_snp.txt \ --out_dir ./results/LOGODetect \ --n_cores 25 \ --target_pop EAS \ --n_topregion 1000

Based on my observation of the temporary output files, it appears that the code executed successfully until it stopped at the 48-hour mark. The last few produced files were "sd2_block112.txt," "sd1_block112.txt," "Qmax_block112.txt," "sd1_block94.txt," "Qmax_block94.txt," and "sd2_block94.txt." This suggests that the code ran smoothly.

Could you please provide guidance on the recommended resources and time required for this step? It would be immensely helpful in optimizing the execution of the analysis.

Furthermore, I would appreciate clarification on a conceptual matter. Suppose we have a target population (EAS) and several auxiliary populations (e.g., EUR, AFR). In this case, should the final output local correlated regions for target pop of the analysis be the union of each pair detection? Thanks for your help!

ghm17 commented 8 months ago

Thanks for your interest in using our tool! In your experiment, how many blocks have already generated intermediate files (e.g. Qmax_block#.txt)? This gives a rough estimate of the computation time per block. Also, can you try reducing the number of cores from 25 to 15 or 10, such that multiple cores will not scramble for memory usage and see if the code runs faster?

For your question about multiple auxiliary populations, we will output local correlated regions independently for each auxiliary population, that is EAS-EUR and EAS-AFR correlated regions in your example. Then the two sets of correlated regions will be used in the following PANTHER step to produce two EAS-specfic SNPs effect, which will be linear combined in the final LEOPARD step.

yd357 commented 8 months ago

Thanks for your suggestions and explanation!

In my previous setting (512G+20cores, 48 hours), 7 Qmax blocks were produced (20, 75, 94, 112, 121, 158, 176). I will try 512G+10cores+48 hours to see if I can get the outcome of the example data under this time period.

Hope it works!

yd357 commented 8 months ago

Greeting expert.

We attempted a 96 hour run of LOGODetect on the provided example dataset using 512GB memory and 10 cores, but it still did not complete successfully.

My team is very interested in applying LOGODetect to those large real datasets for benchmarking performance. We want to allocate sufficient resources for the analysis while optimizing runtime. I notice that your work also applied this method on real datasets for performance estimation comparisons. Do you have any approximate runtime estimates on real dataset applications? Any details will be helpful!

Thank you!

ghm17 commented 8 months ago

It took about 16 hours on the example dataset using 20 computation cores. The computation time on other real dataset with similar number of SNPs will roughly be the same. To reduce computation burden, one solution is to subset the input GWAS summary statitics to HapMap3 SNPs by yourself. This will reduce the number of SNPs to ~1M, can identify slightly fewer but consistent results, while using less than 1 h.

yd357 commented 8 months ago

Sorry to bother you again for the cross-population execution! I've refined my GWAS summary statistics datasets to include only HapMap3 SNPs, retaining about 0.9 million SNPs. After running LOGODetection with 20 cores, I observed an improvement in processing speed. However, the computation still took significantly longer than expected.

In 24 hours, the process yielded a total of 98 Qmax blocks, suggesting a total runtime of approximately 2 days to complete all 185 blocks as indicated in the example partition. However, it is still way longer than 1 hour. I am not sure what problem might it be, as I am afraid that my processing has something wrong. Here is my code to run the example:

!/bin/bash

SBATCH --job-name=LOGODetect

SBATCH --output=LOGODetectoutput%j.log

SBATCH --partition=scavenge

SBATCH --requeue

SBATCH --mail-type=ALL

SBATCH --mem=256G

SBATCH -p day

SBATCH -t 24:00:00

SBATCH -c 20

module load R/4.3.0-foss-2020b

sumstats_file1="BMI_EUR_hm3.txt" # Change this to your desired file sumstats_file2="BMI_EAS_hm3.txt" # Change this to your desired file n_gwas1="359983" # Number of GWAS for first file n_gwas2="158284" # Number of GWAS for second file

cd example

mkdir -p ./results/LOGODetect

Rscript ./X-Wing/LOGODetect.R \ --sumstats $sumstats_file1,$sumstats_file2 \ --n_gwas $n_gwas1,$n_gwas2 \ --ref_dir ./data/LOGODetect_1kg_ref \ --pop EUR,EAS \ --block_partition ./X-Wing/block_partition.txt \ --gc_snp ./X-Wing/1kg_hm3_snp.txt \ --out_dir ./results/LOGODetect/example \ --n_cores 20

Could there be identifiable reasons for this extended runtime? Any insights or suggestions you could provide would be immensely helpful.

Additionally, I have a question regarding the "block_partition.txt" file used for genome partitioning. Is it adaptable for analyses involving different populations or traits, or is it imperative to generate a new, task-specific partition file for each analysis?

Thank you for your time and assistance!

ghm17 commented 8 months ago

Since the script has produced the intermediate files, I think your processing should be ok. Can you check the time point when the Qmax files were generated? If they are generated exactly following the order as Qmax_block1, Qmax_block2, ..., Qmax_block98 as you stopped there, I think your slurm job did not run parallely and that will significantly increase computation time.

yd357 commented 8 months ago

Thanks for response!

The parallel works well as the Qmax blocks generated not in order. For example, in my previous setting on the 5M example dataset (512G+20cores, 48 hours), 7 Qmax blocks were produced (112, 20, 75, 176, 94, 121, 158) (not in order). I have checked the production of Qmax from time to time so I am sure that it's not the problem that cause it.

yd357 commented 8 months ago

Hi expert!

Do you have some clues about the current issue we have for the computation time?

Additionally, I have a question regarding the "block_partition.txt" file used for genome partitioning. I noticed that in your paper, LDetect was applied to get the block partition. Will you be able to share the files with us? It will be a great help!

Also, another bug I encounter most recently is when I tried to use example outcome files from step 1 LOGODetect to run the second step PANTHER to fit PRS, an error occurs:

_Traceback (most recent call last): File "./X-Wing/PANTHER.py", line 237, in main() File "./X-Wing/PANTHER.py", line 185, in main sst_dict[pp] = munge_data.munge_sumstats(ref_dict, vld_dict, anno_dict[pp], param_dict['sumstats'][pp], param_dict['pop'][pp], param_dict['n_gwas'][pp]) UnboundLocalError: local variable 'refdict' referenced before assignment

Someone using PRS-cs met the similar error: https://github.com/getian107/PRScs/issues/33. But different from their issue, I think we do not have that problem because I only used data from the example and example results:

_# Define variables sumstats_file1="BMI_EUR.txt" # Change this to your desired file sumstats_file2="BMI_EAS.txt" # Change this to your desired file n_gwas1="359983" # Number of GWAS for first file n_gwas2="158284" # Number of GWAS for second file target_pop="EAS" # Target population ref_dir="PANTHER_1kg_ref" anno_file="LOGODetect/annot_EUR.txt,annot_EAS.txt" bim_prefix="example/data/test"

module load miniconda conda activate X_Wing

cd example

mkdir ./results/PANTHER; mkdir ./results/PANTHER/post; mkdir ./results/PANTHER/post_collect

for chr in {1..22};do

python \
./X-Wing/PANTHER.py \
--ref_dir ./data/PANTHER_1kg_ref \
--bim_prefix $bim_prefix \
--sumstats $sumstats_file1,$sumstats_file2 \
--n_gwas $n_gwas1,$n_gwas2 \
--anno_file $anno_file \
--chrom ${chr} \
--pop EUR,EAS \
--target_pop $target_pop \
--pst_pop "EUR" \
--out_name BMI \
--seed 3 \
--out_dir ./results/PANTHER_test/post

done done

conda deactivate_

Could you please assist me with these issues? Your help will be really appreciate!

Yikai

jmiao24 commented 7 months ago

Hi Yikai,

Thanks for your interest in X-Wing!

If you change anno_file="LOGODetect/annot_EUR.txt,annot_EAS.txt" in your script to anno_file="LOGODetect/annot_EUR.txt,LOGODetect/annot_EAS.txt", does X-Wing still report error?

Best, Jiacheng

yd357 commented 7 months ago

Thanks for noticing, Jiacheng.

I think that was a typo when I tried to copy&paste my codes to github here. Sorry for the confusion. So I am sure it's not the cause to my problem. Maybe there is another bug about PANTHER that cause the UnboundLocalError.

Another update I can share with you is that for LOGODetect I tried on hm3 sites (~1M) and finally it cost about 25 core + 80G + 49 hours for the example data.

Looking forward to your reply.