High memory usage for bamsormadup on inputs with many reference contigs

chapmanb commented 8 years ago

German; We're using bamcat/bamsormadup within bcbio to merge 27 input BAMs with:

bamcat level=0 `cat files.txt` | bamsormadup threads=8

We expect a ~130Gb output file from the merge.

The merge process has very high memory usage, filling up a machine with 330Gb of memory:

[V] 442783933   03:37:23:67015100 MemUsage(size=279998,rss=198773,peak=280062)
AutoArrayMemUsage(memusage=32544.6,peakmemusage=33442.9,maxmem=1.75922e+13)

We've been trying to understand why we have such poor memory usage for this analysis and wonder if it has to do with the number of reference contigs. This is run on the monkey genome with ~7500 contigs. Would that have any relationship to memory usage? Any other tips as to why we see such high memory usage for this merge? Thanks much for any suggestions or thoughts.

gt1 commented 8 years ago

Hello Brad,

so far I have no idea where the memory could be going. Would it be possible to run the debug version ( binaries at https://github.com/gt1/biobambam2/releases/download/2.0.57-release-20160918185932/biobambam2-2.0.57-release-20160918185932-x86_64-etch-linux-gnu-debug.tar.gz ) using commandline

LIBMAUS2_AUTOARRAY_AUTOARRAYMAXMEM=4g bamcat level=0 cat files.text >/dev/null

This should produce a lot of messages on standard error detailing where how much memory is allocated during the run of bamcat.

mjafin commented 8 years ago

Hi @gt1, cheers for the quick reply. The colleague running this analyses just nuked the work folder and went for just aligning the samples each in one go (instead of splitting and merging). It'll probably take a few days but after all done, I can continue debugging this.

We're using the Macaque fascicularis mfa5 genome with a large number of small contigs.

gt1 / biobambam2

High memory usage for bamsormadup on inputs with many reference contigs #33