kevlar-dev / kevlar

Reference-free variant discovery in large eukaryotic genomes
https://kevlar.readthedocs.io
MIT License
40 stars 9 forks source link

Kevlar novel multithreading #384

Open ghost opened 4 years ago

ghost commented 4 years ago

Hello, I launched kevlar as such

kevlar novel --out output.augfastq --case ../H5A3.cleanup.reads.fa --control ../ARC_ancestor.cleanup.reads.fa -t 48 --max-fpr 1.2 -M 10G

It has been running for 10 hours now. But I never see more than 1 thread in use. Is it normal? Is this step not multithreaded? Thank you :)

[kevlar::novel] Case samples loaded in 2203.54 sec
[kevlar::novel] All samples loaded in 4405.89 sec
[kevlar::novel] Iterating over reads from 1 case sample(s)
[kevlar::novel]     processed 1000000 reads (149.16 seconds elapsed)
[kevlar::novel]     processed 2000000 reads (155.06 seconds elapsed)
[kevlar::novel]     processed 3000000 reads (161.09 seconds elapsed)
[kevlar::novel]     processed 4000000 reads (167.57 seconds elapsed)
[kevlar::novel]     processed 5000000 reads (173.63 seconds elapsed)
[kevlar::novel]     processed 6000000 reads (377.72 seconds elapsed)
[kevlar::novel]     processed 7000000 reads (794.66 seconds elapsed)
[kevlar::novel]     processed 8000000 reads (1272.67 seconds elapsed)
[kevlar::novel]     processed 9000000 reads (1741.69 seconds elapsed)
[kevlar::novel]     processed 10000000 reads (2070.23 seconds elapsed)
[kevlar::novel]     processed 20000000 reads (7007.82 seconds elapsed)
[kevlar::novel]     processed 30000000 reads (13155.10 seconds elapsed)
[kevlar::novel]     processed 40000000 reads (19243.20 seconds elapsed)
[kevlar::novel]     processed 50000000 reads (24312.45 seconds elapsed)
[kevlar::novel]     processed 60000000 reads (28102.23 seconds elapsed)
[kevlar::novel]     processed 70000000 reads (34284.96 seconds elapsed)
standage commented 4 years ago

Hi @aderzelle! The initial k-mer counting steps should be multithreaded, but the main procedure to identify novel k-mers is not multithreaded.

mpinese commented 3 years ago

Hi @standage, thanks for a great tool which we've been working to integrate into our human disease trio work. We've hit an issue with the kevlar novel step though, which I think is similar to the issue @aderzelle raised. Basically, for some samples kevlar novel takes > 48h to complete. This is our HPC walltime limit, so effectively we can't process these samples with Kevlar right now. This is for ~40X human WGS.

Is there some way to parallelise the novel step, perhaps by splitting the case k-mer input? Alternatively can we tweak the config to improve speed without too much effect on sensitivity? Any ideas you have would be much appreciated, thanks!