Closed TCLamnidis closed 2 years ago
@aidaanva I believe this was originally implemented to eager for pathogenomics work, right? Do you think adding these flags might cause issues for such applications?
@aidaanva is sitting next to me and says she sees no reason why it would effect anything on pathogen side either.
Description of the bug
Using
bedtools coverage
within eager currently reserves a large amount of memory that can be inefficient and prohibitive. With an input file of 1.1Gb size, the process required 30Gb(!) of memory to complete, causing multiple retries.Expected behaviour
Get coverage calculations with a smaller memory footprint.
Additional context
It seems that including a genome file (
-g
) and the-sorted
flag tobedtools coverage
can cut down on the computational resources massively.In my limited test set, the output file contents are identical, though the gzipped versions of them have different checksums.
The genome files are generated on the spot, and I believe within eager the bam files that make it to this process will always be sorted, so it seems like an obvious optimisation step.
Is there a reason to avoid implementing this?