Large files and method of parsing BAM

cerebis commented 9 years ago

BAM files based on real experimental data has shown that large files are very inefficiently parsed using the method currently employed.

As of now, sequences are attended to in a specific order, making determining the offset within the contact matrix trivially easy (summed per outer loop). Unfortunately, the pysam method fetch(seq_name) appears to require much up-front IO that scales with BAM file size, resulting in a significant delay for each invocation. In the case of many reference sequences (possibly due to fragmented WGS assembly) this will become a huge penalty.

Therefore, we will require that a method be implemented which determines contact matrix offset from the predetermined sequence order and the fields available in the BAM file. This should not pose a problem.

koadman commented 9 years ago

any value in BamM for this? https://github.com/ecogenomics/BamM

cerebis commented 9 years ago

It might be an easy fix. The thing is, I wrote the code in a way that made the problem easy at the time. There are a few "ass backward" approaches there when the data size scales upward. You found the memory issues too. :-)

koadman / proxigenomics

Large files and method of parsing BAM #35