Large memory footprint in bedtools coverage process

TCLamnidis commented 2 years ago

Description of the bug

Using bedtools coverage within eager currently reserves a large amount of memory that can be inefficient and prohibitive. With an input file of 1.1Gb size, the process required 30Gb(!) of memory to complete, causing multiple retries.

$ cat .command.sh
#!/bin/bash -euo pipefail
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -b PMR002.A0101.SG1_rmdup.bam | pigz -p 1 > "PMR002.A0101.SG1_rmdup".breadth.gz
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -b PMR002.A0101.SG1_rmdup.bam -mean | pigz -p 1 > "PMR002.A0101.SG1_rmdup".depth.gz

$ /bin/time -v bash .command.sh
    Command being timed: "bash .command.sh"
    User time (seconds): 157.60
    System time (seconds): 42.88
    Percent of CPU this job got: 102%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 3:14.69
[...]
    Maximum resident set size (kbytes): 30595900
[...]
    Exit status: 0

Expected behaviour

Get coverage calculations with a smaller memory footprint.

Additional context

It seems that including a genome file (-g) and the -sorted flag to bedtools coverage can cut down on the computational resources massively.

$ cat .command.sh_edited
#!/bin/bash -euo pipefail
samtools view -H PMR002.A0101.SG1_rmdup.bam |grep @SQ|sed 's/@SQ\tSN:\|LN://g' > genome.txt
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -g genome.txt -sorted -b PMR002.A0101.SG1_rmdup.bam | pigz -p 1 > "PMR002.A0101.SG1_rmdup".breadth.new.gz
bedtools coverage -nonamecheck -a 1240K.pos.list_hs37d5.0based.bed -g genome.txt -sorted -b PMR002.A0101.SG1_rmdup.bam -mean | pigz -p 1 > "PMR002.A0101.SG1_rmdup".depth.new.gz

$ /bin/time -v bash .command.sh 
    Command being timed: "bash .command.sh_edited"
    User time (seconds): 80.94
    System time (seconds): 1.03
    Percent of CPU this job got: 110%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:14.05
[...]
    Maximum resident set size (kbytes): 23764
[...]
    Exit status: 0

In my limited test set, the output file contents are identical, though the gzipped versions of them have different checksums.

$ md5sum PMR002.A0101.SG1_rmdup.{dept,bread}*[!gz]
b1958389b5520a5f31e084eca5458610  PMR002.A0101.SG1_rmdup.depth
b1958389b5520a5f31e084eca5458610  PMR002.A0101.SG1_rmdup.depth.new
32c1422d4cd05f84616057c736105b6b  PMR002.A0101.SG1_rmdup.breadth
32c1422d4cd05f84616057c736105b6b  PMR002.A0101.SG1_rmdup.breadth.new

$ md5sum PMR002.A0101.SG1_rmdup.{dept,bread}*gz
707a260525c99e11e7711f6d14488e68  PMR002.A0101.SG1_rmdup.depth.gz
0b3ebac66221ce100ab55fab1208f7ea  PMR002.A0101.SG1_rmdup.depth.new.gz
87f8c0042a20e79600d11356bfd4d811  PMR002.A0101.SG1_rmdup.breadth.gz
e75df82f6254968a1d09f8a902ab1b64  PMR002.A0101.SG1_rmdup.breadth.new.gz

The genome files are generated on the spot, and I believe within eager the bam files that make it to this process will always be sorted, so it seems like an obvious optimisation step.

Is there a reason to avoid implementing this?

TCLamnidis commented 2 years ago

@aidaanva I believe this was originally implemented to eager for pathogenomics work, right? Do you think adding these flags might cause issues for such applications?

jfy133 commented 2 years ago

@aidaanva is sitting next to me and says she sees no reason why it would effect anything on pathogen side either.

nf-core / eager