Open replikation opened 2 years ago
hello, this is actually a complex issue so I will elaborate a bit.
first, I want to point out that multiple samples are processed concurrently, I assume this test you made was with a single sample, that's why it didn't use all 24 cores available.
second, Mutect2 in particular is not multi-thread, the usual way Mutect2 is parallelized is by splitting the genome into multiple parts and running Mutect2 on each part concurrently, this is straight forward in eg. human genome by running each chromosome separately, but sars-cov-2 genome is a single contig, we could split this contig into multiple intervals but this could result in calling issues around the edges from one interval to the next.
finally, there is a different way we can speed up the process, by down-sampling the reads, currently the pipeline have this option set for Mutect2 call: --max-reads-per-alignment-start 200
, reducing this number will speed up Mutect2 at the cost of less sensitive calls, how much less sensitive ? - that remains something to investigate, but if you're curious I would recommend halving this value and comparing with previous results.
Hi,
thanks for developing this. I would really like to use this tool but it takes quite some time, which might be due to the lack of some thread handling? Is it possible to use some multithreading on the
rule pool_mutect
? This rule currently runs for approx. 3h and the whole workflow takes 3:20h. I noticed that only one core was assigned to this rule.Also it seems that the multithreading is not optimally set up as it is not using all the provided cores: E.g. why are the rules scaling down to 8?
thank you