gt1 / biobambam2

Tools for early stage alignment file processing
Other
93 stars 17 forks source link

"Too many open files" #50

Open map2085 opened 7 years ago

map2085 commented 7 years ago

I am working with very large data. Gzip FASTQ size = 250 GB . I split the FASTQ file into ~1,200 smaller FASTQ files. I aligned the 1,200 FASTQ files with BWA, standard parameters.

Now I am trying to merge the 1,200 small BAM files (~350 Mb each) with biobambam2.

Immediately upon calling biobambam2 bammerge, it fails with error message: "Too many open files"

gt1 commented 7 years ago

Try using bamcat instead. This will not open all input files at the same time. If you want the output to be sorted then use

bamcat level=0 in1.bam in2.bam ... | bamsort
map2085 commented 7 years ago

I understand. This workaround would be very inefficient though, since it would have to re-sort all of the files after cat, even though the files were pre-sorted, right?

gt1 commented 7 years ago

You can try whether a multiple stage merge is faster, i.e. use bammerge to merge subsets, then merge the pre merged files. bammerge currently has no support for doing multiple stage merges directly.

map2085 commented 7 years ago

yeah, I have implemented the multiple intermediate stage merge workaround. It's not difficult, but cumbersome and a nuisance. I just thought to post the notification here to alert everyone.

biobambam2 works great though!