marcelm / dnaio

Efficiently read and write sequencing data from Python
https://dnaio.readthedocs.io/
MIT License
62 stars 9 forks source link

check quality score integrity #111

Closed rhpvorderman closed 1 year ago

rhpvorderman commented 1 year ago

In quite a lot ofroutines concerning quality I use the following pattern.

This is also how it is done in the cutadapt maximum expected errors code.

How nice would it be if the boundscheck could.be removed. That would make things faster. It would.be great if dnaio could guarantee this.

I think it can do so quite cost effective. Simply do the above pattern without the lookup and instead store each quality after checking. It is a poor-performance version of memcpy that can be utilized when the qualities pyobject is created. This can then be further upgraded using sse2 vector instructions. Using _mm_loadu, _mm_cmpgt_epi8, _mm_cmplt_epi8,_mm_storeu and _mm_or. A quite simple loop where data is.loaded. both bounds are checked (no unsigned comparison for 8byte integers in SSE) the result of both boundschecks is orred to a register that saves the result. After that storing the vector.at the destination. In the end check the Boundscheck vector if any bounds have been crossed. If so return an error.

Because dnaio does the boundscheck, it can be eliminated from any programs using dnaio. That is quite handy for performance.

rhpvorderman commented 1 year ago

While in theory this is a good idea in practice you are retrieving the quality score from memory anyway. The substracting 33 and the higher than 93 check is trivial (this is pipelined so has very low latency). The branch is never taken for millions and millions of bases at an end. As a result the cost for checking when reading is effectively free. An extra check at the parser however is not, as the whole array of values is read again.

I tested this on sequali where I added the check in the parser and removed it in the two loops that were reading the qualities. Turns out it was slower as I am now reading the qualities three times instead of two times. Too bad. Turns out easy to predict branches are very cheap.

On the upside: this means that it is not very destructive to be programming defensively in terms of performance. If errors are in the "almost non-existent" category the branch predictor is really helping with performance.