marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 125 forks source link

Make expected errors use doubles again for increased precision #727

Closed rhpvorderman closed 9 months ago

rhpvorderman commented 9 months ago

Hi Marcel,

It seems I made an error of judgement in #726. In that thread I state that single precision numbers is enough. That statement is incorrect, or a half-truth.

Since phred scores are between 0 and 93 they have only two significant decimal digits. Two decimal digits means 100 units, so that can be approached by a floating point number that has a significand of at least 7 bits significance. Floating point has 23 bits, so that is ample.

However this does not properly look at the calculation required to add two phred units. Single precision numbers have a 23-bit significand meaning that they can represent numbers that are 2^23 ~= 8 million units apart. Adding a phred of 0 to a phred of 70 (10 million units) is therefore barely possible. Adding a phred of 93 to a phred of 0 simply does not register. That means that floating point numbers can never be sufficient as an intermediate when summing large numbers of phred probabilities as some phreds simply get lost.

This is still true when looking at more practical use cases. In practice the highest illumina phred is 37. So 10**-3.7 verses 1.0 that is roughly 13 bits. Combined with a read length of 151 (9 bits) that makes for 22bits of precision needed to ensure all probabilities. Float has that. Going beyond illumina to nanopore however it immediately becomes clear that there aren't enough bits as nanopore can reach sizes up to 2 million (21 bits) and has a phred range between 0 and 50ish (17 bits).

A double precision number with a 53-bit significand is ample however for most practical use cases. I have sharpened the test case again so that it is correct even when using the extremes of the phred range.

Speed does not mean anything when the result is not correct. Luckily the vectorization can be kept and expected errors is still vectorized and still faster than 4.4.

rhpvorderman commented 9 months ago

Never mind what I said about the speed. I was looking at loop unrolling. Specifically at how I can write a loop such that the compiler unrolls it. Then I found one of the best answers on stackoverflow ever: https://stackoverflow.com/a/36591776/16437839

This answer notes that when there is a dependency in a loop, GCC does not unroll. As example an addition was given, which is exactly the calculation here. In order to leverage out of order execution it is best to assign the calculation to different variables to calculate intermediate results and sum them afterwards.

So I did, and the result is faster than my previous float implementation. So it is both faster and more accurate. How awesome is that? I am elated!

marcelm commented 9 months ago

Cool it’s even faster! Interesting SO answer, need to keep that in mind.

Accurately computing the sum of an array of floating point numbers is a topic in itself. To get even more accurate results, one could use Kahan summation, see also this blog article. But that is probably not necessary.

In practice the highest illumina phred is 37. So 10**-37 [...]

It’s not that bad! You forgot to divide by 10: p=10**(-q/10). So it’s $10^{-3.7}$.

rhpvorderman commented 9 months ago

You truly are a fountain of knowledge. Thanks again! I think better precision than double is not yet required, unless maybe when nanopore can start sequencing entire chromosomes. Let's do the thought experiment:

250 million basepairs. Roughly 2 ** 28. 28 bits. A quality range of 0 to 50ish: 17 bits.

That's 45 bits, we have 53. That should still be ample. Seems this is futureproof!

It’s not that bad! You forgot to divide by 10: p=10**(-q/10). So it’s .

Oops. That was a typo. I have corrected it. The 13 bits of precision needed is accurate. 10-3.7 is roughly one in 5.000 so that would need 13 bits (2**13 == 8192).

rhpvorderman commented 9 months ago

Creating this PR was massively helpful too. The lessons learned where applied in two PR's for sequali. One unleashing the power of SSE2 for adding doubles https://github.com/rhpvorderman/sequali/pull/31, and another for having 40%+ execution time reductions due to loop unrolling: https://github.com/rhpvorderman/sequali/pull/32.

I knew that understanding the hardware was useful, but at this point it starts to feel as I am wielding magic. I am sure this feeling will wear off, but for now I am enjoying the surge of "programmer hubris" hormones ;-).

marcelm commented 9 months ago

Forgot to merge, LGTM!