Open rhpvorderman opened 1 month ago
Oh, interesting, I guess this needs to be done on the bgzip-level?
No, not really. Bgzip is just concatenated gzips. There is no requirement for the bgzips to be split at the bam record level. A bam record can start in one block and end in another, even if it could fit entirely in a block of its own. Nanopore records often will exceed the maximum size of a bgzip block.
So we can just decompress the whole thing as one big filestream and parse the records out. We already do this for single-end. For chunking we can make use of the fact that BAM records store their block sizes at the beginning. So there is no need to read the entire block. Chunking should be much faster than for FASTQ.
Currently chunking only works for FASTQ. See https://github.com/marcelm/cutadapt/issues/811