Allow chunking uBAM - Githubissues

rhpvorderman commented 1 month ago

This allows reading raw BAM records with no header present. This format cannot be detected so must be provided beforehand. "bam_no_header" was chosen as the format name.

A chunker is also provided that can provide chunks of BAM records. These can be read using the "bam_no_header" format in dnaio.open.

With these additions cutadapt should be able to set up multithreaded reading of BAM records.

rhpvorderman commented 1 month ago

@marcelm I did a proof concept here: https://github.com/marcelm/cutadapt/pull/812/files

It is possible to get multithreaded BAM reading going with just the changes in this PR.

In the future we need to think how to handle the header. Ideally the header is copied from the input file with an additional PG line added from cutadapt.

EDIT: I propose delaying releasing 1.3.0 until BAM writing and tag support is implemented. Some of the tags, mainly the methylation tag, also need to be cut properly if cutadapt is going to be useful for Nanopore data. It is not going to be "fun" to implement this, but on the other hand this will elevate dnaio and cutadapt over the competition, making it the only viable tool.

marcelm commented 1 month ago

Awesome, thanks! Will hopefully be able to review this towards the end of the week, no time today.

rhpvorderman commented 1 month ago

@marcelm . Friendly reminder ping. I am currently watching cutadapt blazing through some nanopore files at only one core, which reminded me of this PR. Please do not take this as pressure. I am happy to wait for a few weeks more.

rhpvorderman commented 3 weeks ago

Awesome, this looks good! And sorry for the delay. It did not actually take that long to review, should have done that earlier.

There is no haste at all. I can run code from git if needed.

So one thought I had is that this way of chunking only works for single-end reads. So that is one more issue that would need to be solved in order to enable BAM paired-end input.

Let's think about that when the use case arises. In the event, only support for name-sorted reads seems viable. Indexing support is best done through htslib and in that case pysam is a much more viable option.

With Python’s support for free threading coming, I’d actually hope that we could get rid of all this chunking business in the future. Splitting the input up into chunks of bytes is only necessary because sending already parsed records from one process to another using multiprocessing is expensive. With threads, we could have one thread that is dedicated to parsing the input and supplying worker threads with lists of records. The parser thread can then make decisions based on the content of the records, that is, it can ensure that a worker always gets both paired-end reads.

Yes that would be great! Looks like we still need to wait a bit on that. I am curious what the impact of that change will be.

marcelm commented 3 weeks ago

Feel free to merge when ready.

rhpvorderman commented 3 weeks ago

Done. Thanks for the review!

marcelm / dnaio

Allow chunking uBAM #141