jpuritz / dDocent

a bash pipeline for RAD sequencing
ddocent.com
MIT License
52 stars 42 forks source link

Use bedops instead of bedools merge for memory efficiency #68

Closed ne1s0n closed 3 years ago

ne1s0n commented 3 years ago

I found that the command

bedtools merge -i cat-RRG.bam -bed > mapped.bed

uses a lot of memory and often gets the pipeline killed. An efficient alternative is to use the bedops suite, and in particular substitute the previous code with:

bedtools bamtobed -i cat-RRG.bam > cat-RRG.bed bedops --merge cat-RRG.bed > mapped.bed

This approach uses a trivial amount of memory. It requires the extra bam -> bed transformation, but it allowed me to run the pipeline on a previously unavailable system. Since it is an extra tool, it would require an update in the installation instruction. Fortunately bedops is in conda, so

conda install bedops

is enough. The problematic command is present in two places in the dDocent code:

https://github.com/jpuritz/dDocent/blob/9718247b7f533a71057787d77c5232b6b97065c5/dDocent#L407 https://github.com/jpuritz/dDocent/blob/9718247b7f533a71057787d77c5232b6b97065c5/dDocent#L1197

ne1s0n commented 3 years ago

I wanted to add that I've tested the command and the resulting mapped.bed file is identical to the one obtained via bedtools.

jpuritz commented 3 years ago

Thanks for this. Would you confirm if this is still true with the newest version of bedtools? Also, if you could provide just a couple of benchmarks that would be useful.

Thanks!

pdimens commented 3 years ago

@jpuritz @ne1s0n I can try to benchmark this over the weekend

pdimens commented 3 years ago

@jpuritz I'll still run the bencharks, but here's the benchmarks provided in the bedops docs for the comparison btwn bedtools: bedops

pdimens commented 3 years ago

@jpuritz Here is a real-world benchmark:

bedtools merge

I had to cut this off prematurely b/c it used 100% RAM and 100% SWAP

> bedtools merge -i cat-RRG.bam -bed > tmp.mapped.bed
3671.80s user
1484.30s system
2:06:45.18 total time elapsed
67% cpu
254330 kb memory
141880280 file input operations
31296 file output operations

bamtobed + bedops

The conversion to bed

> time bedtools bamtobed -i cat-RRG.bam > cat-RRG.bed

4443.79s user
241.44s system
1:35:28.61 total time elapsed
81% cpu
47 kb memory
161421928 file input operations
313462296 file output operations

The merge

> time bedops --merge cat-RRG.bed > mapped.bed

742.32s user
76.32s system
15:33.65 total time elapsed
87% cpu
5 kb memory
214871480 file input operations
35216 file output operations
So by comparison: tool time peak ram
bedtools 2hrs+ 254gigs +
bedops 1hr 50m 52kb