When combining a large number of samples, the speed is very slow

zhongleishi commented 4 months ago

Hi, thank you for releasing such a nice tool.I have 302 samples (30x) of vcf files, when I use dysgu merge, whether it is to split chromosomes or not to split, it is running very slowly, is there any good way?Looking forward to your reply

kcleal commented 4 months ago

Hi @zhongleishi, Would you mind sharing your merge command? I recommend setting the --procs option, that will partition the job and should make it a bit quicker.

Sometimes very high coverage regions can cause problems during merging, for example ALT chromosomes. It might be helpful to remove some of the very low quality variants before merging, for example using dysgu filter and setting --min-prob to 0.2.

zhongleishi commented 4 months ago

Thanks for reply,I tried two approaches,I just use the three filtering methods provided by the tutorial to filter in advance. 微信截图_20240223171004

dysgu merge -p24 --input-list list > merged.vcf dysgu merge -v 2 *.vcf > merged.vcf

kcleal commented 4 months ago

Would you mind sharing, how long did the runtime take for merging? I am planning some work on making the merging more efficient, so would be useful for me to know. Thanks

zhongleishi commented 4 months ago

No problem, just now I accidentally submitted a new question exactly the same, you don't need to reply. But now the program is not finished running, it has been two hours

kcleal commented 4 months ago

Hi @zhongleishi, Did the merge complete successfully? Im planning on doing some work on the merging pipeline next week to make it scale better. I can try and address any issues you are having.

zhongleishi commented 4 months ago

Unfortunately, it's still not finished. I tried dysgu merge-p24 --input-list samples.txt --wd wd > combined.vcf .After the command, in the wd folder can produce a chromosome and SV type vcf file, but still failed to merge the combined file, and after running for about 20 hours, I found that the process was killed. #41 I am currently trying to merge through jasmine and then type through dysgu. I hope I can get the result I want

kcleal commented 4 months ago

In the working directory the merge is partitioned into chromosome and SVTYPE mini jobs. For example, all vcfs with chr1 and DEL are merged together. Mini jobs eventually get concatenated at the end. It is possible to run these batches separately on the command line, this might allow you to see if there is problem chromosome causing the runtime issue. I can add some extra logging to the output (tomorrow hopefully), this should be useful for diagnosing some issues

zhongleishi commented 4 months ago

Yes, you are right, the working directory is divided into many categories, but only invertion has the population merge file for each chromosome, and the same goes for translocation, like dup_ins and del only has the vcf file for each chromosome split for each individual. And the final combine file is also empty 微信截图_20240229214143 微信截图_20240229214202

kcleal commented 4 months ago

Hi @zhongleishi , I have added an option to dysgu merge that will show you detailed information about the progress of the pipeline. If you add --progress flag, that might help further identify the problem. To use the new version of dysgu go here: https://github.com/kcleal/dysgu/actions/runs/8096049009 then download the artifact, unzip it and use pip install to install the appropriate version. There should be wheel files for python >=3.8 and MacOS/Linux.

It looks like DEL style events are probably taking up most of the run time. The output log should tell you if this is due to one problem chromosome or not. Ive you could share the output log file that would be very helpful. By the way, individual jobs in the working directory can be tested using dysgu merge, for example:

cd working_directory
dysgu merge *~DEL_chr1.vcf > chr1.merge.vcf

Thank you for your time!

kcleal commented 3 months ago

v1.6.3 has been released and should be more efficient for merging large cohorts. I will close for now, but please open if you are having issues

zhongleishi commented 3 months ago

I am very sorry that I did not have time to reply because of the completion of the project some time ago. I tried dysgu 1.6.3 and successfully merged it, but I did it by chromosome. But one question I have is that I tried three different commands, Their output and time are inconsistent. "dysgu merge-p24 --input-list listChr01 --wd Chr01 > Chr01_combined.vcf" "dysgu merge .split.Chr01.vcf.gz --wd Chr01_2 > Chr01_combined2.vcf”“dysgu merge .Chr01.vcf.gz > Chr01_combined3.vcf "takes 7 hours, 10 hours, and 10 hours respectively, and gets 60868, 60869, 61075 SV, which command is more recommended to use? Thanks again for your help

kcleal commented 3 months ago

Hi @zhongleishi, I would recommend the first two approaches, although they should produce comparable quality merges. Differences mostly occur at low quality ambiguous regions. Im surprised that the merge took as long as it did. I will have to run some more experiments to see if these runtime issues can be avoided. For example, merging 150 identical samples on my machine takes around 15 mins using 12 cores.

zhongleishi commented 3 months ago

Thank you very much for your continued attention and help on this issue. The problem with the merger has now been resolved.

kcleal / dysgu

When combining a large number of samples, the speed is very slow #82