Open cerebis opened 9 years ago
any value in BamM for this? https://github.com/ecogenomics/BamM
It might be an easy fix. The thing is, I wrote the code in a way that made the problem easy at the time. There are a few "ass backward" approaches there when the data size scales upward. You found the memory issues too. :-)
BAM files based on real experimental data has shown that large files are very inefficiently parsed using the method currently employed.
As of now, sequences are attended to in a specific order, making determining the offset within the contact matrix trivially easy (summed per outer loop). Unfortunately, the pysam method
fetch(seq_name)
appears to require much up-front IO that scales with BAM file size, resulting in a significant delay for each invocation. In the case of many reference sequences (possibly due to fragmented WGS assembly) this will become a huge penalty.Therefore, we will require that a method be implemented which determines contact matrix offset from the predetermined sequence order and the fields available in the BAM file. This should not pose a problem.