Closed rhpvorderman closed 1 year ago
I contributed back the encoded sequence parsing to htslib: https://github.com/samtools/htslib/pull/1677 . It turns out that the code is already faster even without the SSSE3 routine due to the while loop. QED while loops are better than for loops for walking over bits of memory. It makes sense because a loop control variable not needed, so that is one less operation. On the other hand, integer operations are lightning fast so it is a bit surprising.
BTW, it’s super great that this gets better performance than samtools fastq
. However – and sorry for dampening your enthusiasm – at this stage, it’s more important for me to get the semantics correct (single-end vs paired-end, what to do if certain flags are set in the input).
SSSE3 was introduced 17 years ago, so I doubt anyone is still using CPUs that can't handle it.
The Steam hardware survey agrees with you. I’m not sure I mentioned this before, but I’m fine with requiring x86-64-v2 (which goes up to SSE4.2).
However – and sorry for dampening your enthusiasm – at this stage, it’s more important for me to get the semantics correct (single-end vs paired-end, what to do if certain flags are set in the input).
Completely agree. I just wanted to make sure that the fundamentals (performance-wise) are solid.
The Steam hardware survey agrees with you. I’m not sure I mentioned this before, but I’m fine with requiring x86-64-v2 (which goes up to SSE4.2).
You didn't mention it before, but I also agree on this part. I think minimap2 goes up to SSE4.1 and I think the tradeoff in cost vs benefits is very clear in that case. Same here with the SSSE3 routine.
I am currently working on other things, but I will come back later (hopefully this week) to address the BAM parser concerns.
I also don’t think it’s super obvious what a BAM reader does. It could just return unmapped reads, for example, or have certain requirements on the BAM file. What happens to supplementary alignments, secondary alignments etc. So some decisions need to be made and those need to be documented.
When I wrote the parser I only had unaligned BAM for nanopore reads in mind. That is going to be the primary use case. These reads are single-end (allthough duplex reads are possible but that works differently). So how to best handle this? I think it is going to be quite some effort to handle all the alignment cases when cutadapt will be dealing with a 100% uBAM in practice. Maybe it is best to only allow to read BAM in single-end format, no paired files, and with no knowledge of mates. It can check the alignment flags and throw an error that it works on unaligned BAM only. Alternatively some filter could be made that catches everything but secondary alignments, but I am not quite sure if the semantics can be completely right.
Thanks! I think what also needs to be addressed is that BAM files can contain mixed single-end and paired-end data. IIUC, this version of the PR just returns all reads as single-end reads, no matter whether the READ1 or READ2 flag is set.
It is going to be next to impossible to get the correct pairs unless they are ordered in interleaved fashion. (In which case dnaio already has code to handle it.) It is possible using BAM indexing, but that is difficult to implement, and it relies on alignment which is not great in this use case.
What do you think about this?
Maybe one could return tuples of SequenceRecords like the MultipleFileReader
already does. The tuples would have one or two elements depending on whether it’s a single-end or paired-end read.
And yes, I think requiring name-sorted BAMs (that is, same order as interleaved FASTQs would be) is totally fine. I think the focus should be supporting uBAM (and if it happens to work for a file with mapped reads, that’s also fine).
These are the bits that can be found in the BAM sequence. Most of them have difficult implications to how the reads should be interpreted in the terms of dnaio. So I checked what actual flags are used in dorado ubam.
samtools view HG002_20230424_1302_3H_PAO89685_2264ba8c_hac_simplex.bam | cut -f 2 | sort| uniq -c
7646392 4
So that is easy then. How about:
if flags != 4:
raise NotImplementedError("dnaio was implemented with unmapped BAM files in mind to support nanopore sequencer input. Mapped BAM files are not supported. Please use samtools fastq.")
In that way there is a good guarantee that it does not let any weird garbage through.
And yes, I think requiring name-sorted BAMs (that is, same order as interleaved FASTQs would be) is totally fine. I think the focus should be supporting uBAM (and if it happens to work for a file with mapped reads, that’s also fine).
Won't that work automatically already if the BAM is passed with interleaved=True
? It will check then if the nth and the n+1th sequence have the same name right? It will throw an error otherwise. So I don't think there is a specific need to implement this separately for the BAM reader.
How do you think about this? Can just the one check be implemented and this be done?
BTW, did you know you can run samtools flags
to get the table with SAM flags? You can also give it a flag value and it will explain what that means (for example, samtools flags 99
outputs PAIRED,PROPER_PAIR,MREVERSE,READ1
).
So that is easy then. How about:
if flags != 4: raise NotImplementedError("dnaio was implemented with unmapped BAM files in mind to support nanopore sequencer input. Mapped BAM files are not supported. Please use samtools fastq.")
Yes, sounds good to be quite restrictive about the type of BAM that is accepted. We can later relax those restrictions when we have an actual use case.
Won't that work automatically already if the BAM is passed with
interleaved=True
? It will check then if the nth and the n+1th sequence have the same name right? It will throw an error otherwise. So I don't think there is a specific need to implement this separately for the BAM reader.
That should mostly work, except that for an unmapped BAM with paired-end reads, we would need to require the flag values to be 0x45 (PAIRED,UNMAP,READ1) and 0x85 (PAIRED,UNMAP,READ2). I think for this to be usable in Cutadapt, that would need to be implemented, but I’d be happy to do that myself when the time comes.
Let’s just add single-end support for now, which is what you need.
BTW, you can also work in branches in this repo directly instead of in your own fork (makes it also sligthly easier for me to check out branches).
All the review comments are addressed. I also made sure the SSSE3 routine is checked in the CI. The CFLAGS variable is inherited by tox when building the package, no extra configuration is needed. (I have verified this using a print statement in the SSSE3 routine). It is only tested on linux for one version of python but I suppose that is enough as there is no interaction with cpython api for that particular function.
BTW, you can also work in branches in this repo directly instead of in your own fork (makes it also sligthly easier for me to check out branches).
I will do that on the next PR. For this PR that would mean I would need to open a new PR. It seems I cannot choose a different base branch on the fly.
@marcelm Have your concerns be adequately addressed?
Please give me two more days
Sorry, it was not my intention to rush you. Take all the days you need. We are expecting a lot of nanopore requests in the feature, yet I have had only one request yet. And that has already been handled.
Let’s do this. I’ll add a changelog entry after merging.
Hi, sorry for my late response to your review. I am currently at sick leave (again unfortunately). I expect to be working again tomorrow. As regards tot the BAM parser. I intend to follow up with tag parsing, putting the tags in the header in the same fashion as samtools fastq does this, so this information is retained for use in/after cutadapt.
Thanks and no worries, hope you get better soon!
I want to do tag parsing at a later date so the code review (and my efforts) can be split into two sessions. Getting name, sequence and qualities out is by far the most important so I decided to do that first.
As for performance, I enabled SSSE3 (slightly later than SSE3) support by default on linux. This makes the parsing of the encoded sequences 5 times faster. This does not result in a great speedup since that was already quite fast, but still, it is a great fit for the PSHUFB instruction, so it would be bad not to use it. Users can still easily compile from source setting
DNAIO_CPU_BASIC=1
in their environment. That will disable the use of the-mssse3
compile flag. Linux users will always have a compiler installed, so this should not be a problem. SSSE3 was introduced 17 years ago, so I doubt anyone is still using CPUs that can't handle it.As for the performance, even though I did not do anything special, it is much faster than samtools fastq. I benchmarked with this script on python 3.11
Samtools fastq
Dnaio with SSSE3
dnaio without SSSE3
EDIT: So the speed difference is quite good, but even more so when you realize that python-isal is doing the bgzip parsing, which takes approximately 800ms. bgzip however only takes 250 ms to completely decompress this (uncompressed) file (basically it will probably optimize by just straight out copying the uncompressed data, that trick is very fast). However that means that samtools is taking 2 seconds to create the fastq while dnaio only takes 200ms. Samtools is really doing something weird here. The SSE2 instructions on the quality copy will make a difference, sure, but not this much.