marcelm / dnaio

Efficiently read and write sequencing data from Python
https://dnaio.readthedocs.io/
MIT License
61 stars 9 forks source link

Chunking for uBAM #140

Open rhpvorderman opened 2 weeks ago

rhpvorderman commented 2 weeks ago

Currently chunking only works for FASTQ. See https://github.com/marcelm/cutadapt/issues/811

marcelm commented 2 weeks ago

Oh, interesting, I guess this needs to be done on the bgzip-level?

rhpvorderman commented 2 weeks ago

No, not really. Bgzip is just concatenated gzips. There is no requirement for the bgzips to be split at the bam record level. A bam record can start in one block and end in another, even if it could fit entirely in a block of its own. Nanopore records often will exceed the maximum size of a bgzip block.

So we can just decompress the whole thing as one big filestream and parse the records out. We already do this for single-end. For chunking we can make use of the fact that BAM records store their block sizes at the beginning. So there is no need to read the entire block. Chunking should be much faster than for FASTQ.