joshuak94 / BAMIntervalTree

BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Reduce memory consumption when constructing the tree. #23

Closed joshuak94 closed 3 years ago

joshuak94 commented 3 years ago

Currently, we store start (32 bits), end (32 bits) and file offset (64 bits) for each read, so 124 bits per read. For larger files (HG002.hs37d5.2x250_sorted.bam from GIAB) there are 837,504,748 reads, which requires ~110 GB of RAM to store the list of records.

There has to be a way to construct the tree without having to store this entire list. We need the start/end/offset for the interval nodes (start and offset of left-most read, end of right-most read), and the start/end of each record for the median calculations...

joshuak94 commented 3 years ago

htslib::hts_idx_get_stat will get # of mapped and unmapped reads from a chromosome if you give it a BAI file.

joshuak94 commented 3 years ago

htslib::hts_idx_get_n_no_coor gets total # of unmapped reads?

joshuak94 commented 3 years ago

Resolved by #25, we now construct one interval tree per chromosome.