samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

Change bounds checking in probaln_glocal #1616

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3".

Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array.

However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser.

The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion.

This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output.

Fixes #1605

jkbonfield commented 1 year ago

If @pd3 or @lh3 care to review this change it would help. Trying to understand the indexing here is non-trivial, so someone more familier with the code should give it a glance over.

daviesrob commented 1 year ago

I think the fix may be correct, but the comment isn't. The loop being altered looks like it's trying to sum over the values set in the previous loop when i == l_query. The confusing part is that instead of working out which values were set in the same way (from beg to end inclusive), it instead iterates over the entire range 1 ... l_ref and then rejects the ones that weren't set by doing a check on the calculated value of u. So it's not trying to prevent an out-of-bounds access, but instead working out which values it's supposed to be summing over.

The tricky part of this is trying to work out if the restriction on u is actually doing the same job as iterating from beg to end in the earlier loop. It looks like it does from experimentation, but it's a bit difficult to prove. However, it probably doesn't matter if it goes over as anything unset should have been zeroed by calloc() and so shouldn't affect the final sum. Hopefully.

jkbonfield commented 1 year ago

Feel free to correct the comment and push back a change. The code is convoluted so any improvement to documentation is most welcome!

daviesrob commented 1 year ago

Comments adjusted. Hopefully the new ones make what's going on a bit clearer.

This code needs more units tests, and refactoring.

jkbonfield commented 1 year ago

Refactoring is out of scope for this PR IMO and would just muddy the evaluation and dramatically slow down acceptance. I'd prefer one PR to fix bugs, and a completely different one to refactor the code which may take much longer to get done.

Testing is more interesting. I'd extend the tests if I could find any, but it seems to be entirely outside of our testing framework bar user-tests in things like "samtools mpileup". That however has no way of setting things like band widths, so again I think full unit tests are probably best punted to a future PR, along side refactoring.

As for the comments you added, the first is correct, but you're stating the bug in 1605 only existed between 1.8 and 1.17. Have you explicitly checked this?

For the "normal" case of bw2 < l_ref, u >= i_dim - 3 is identical to the 1.7 code of u >= bw2*3+3. It was changed in 5d7a7823a to prevent excessive memory usage. So there was no bug before or after 1.8 in that situation. For bw2 >= l_ref, as in this bug, it previously still used i_dim = bw2*3+6 too as the ?: conditional only appeared with your memory-reduction PR, so the change of u check has no impact there either. Is it the case that the loop termination (k <= l_ref) coupled to the over-sized memory means the off-by-one error is basically avoided through luck?

I think the comment is correct therefore, but can't be fully sure.

It also makes me wonder whether the bug was actually the ?: initialisatio of i_dim, and whether it should have just been l_ref*3+9 instead so it accounted for additional termination space. Although I think what we have is equivalent anyway.