torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

risk of silent 8-bit unsigned integer overflow in score matrix #177

Closed frederic-mahe closed 10 months ago

frederic-mahe commented 10 months ago

The score tables produced by matrix.cc are good candidates for a conversion to std::array. While looking at the code in score_matrix_read(), I've noticed that there is no check that the penalty_mismatch value actually fits in 16-bit and 8-bit unsigned integer types.

The way penalty_mismatch is computed is a bit complicated, but when all other pairwise alignment options are left untouched (default values), predictable penalty_mismatch values can be obtained by setting the option --mismatch-penalty to 7 + 12a, where a is a positive integer.

For instance:

That last value is bigger than 255 and will overflow a 8-bit unsigned integer.

To check that, let's compile swarm without optimizations:

cd ./swarm/
git pull
make clean && make -j PROFILE=1

In Emacs, run gdb:

gdb -i=mi ./bin/swarm
gdb-many-windows

# set up
watch parameters.penalty_mismatch
display parameters.penalty_mismatch
break matrix.cc:66

# run
run -d 2 <(printf ">s1_10\nAAA\n>s2_1\nAAT\n") --mismatch-penalty 127
continue
continue

# print the first 128 positions of score_matrix_8:
print *score_matrix_8@128

quit
$1 = '\b' <repeats 16 times>, '\000' <repeats 16 times>, "\b\000", '\b' <repeats 14 times>, '\000' <repeats 16 times>, "\b\b\000", '\b' <repeats 13 times>, '\000' <repeats 16 times>, "\b\b\b\000", '\b' <repeats 12 times>, '\000' <repeats 15 times>

As expected, score values are set to character \b, which has the decimal value 8 (264 mod 256). The penalty for a mismatch is underestimated.

A test could be added in swarm.cc, function args_check(), at least to check if parameters.penalty_mismatch is lower than 256. @torognes Should I do that?


Secondary question

score_matrix_read() produces 32 by 32 tables looking like this 8 by 8 mock-up:

1 1 1 1 0 0 0 0
1 0 1 1 0 0 0 0
1 1 0 1 0 0 0 0
1 1 1 0 0 0 0 0
...

@torognes do you confirm that this is the expected outcome?

torognes commented 10 months ago

It's good that you spotted this potential problem. I'll have a look at it next week.

frederic-mahe commented 10 months ago

Regarding the secondary question, I had fun this week trying to refactor matrix.cc. I ended up with a single header file matrix.h (on the dev branch) containing a template function that creates std::array score matrices of three different types (unsigned char, unsigned short and int64_t).

Most of the code is checked and optimized away at compile time, the remaining loop is vectorized, the creation-destruction of the arrays is managed automatically by the compiler (no more new and delete), and the code passes all clang-tidy checks (zero warning) :-)

matrix.h contains an example of a matrix as produced by the function. Could you please confirm that this is the expected outcome?

torognes commented 10 months ago

Regarding question 1: Yes, please go ahead and add a check for overflow.

torognes commented 10 months ago

Regarding question 2:

Here is some reasoning behind the score matrices:

Swarm only allows the nucleotide symbols A, C, G and T/U, and no ambiguous symbols. It generally uses a simple encoding where A=1, C=2, G=3, and T/U=4. This is shown in db.cc. I do not think zero (0) is actually used, but it may used to indicate a gap (-) in alignments.

When the sequences are stored in compressed form in memory using only 2 bits per nucleotide, the encoding A=0, C=1, G=2, and T/U=3 is used. It easy to convert between these two encodings by adding or subtracting 1.

The code for performing the SIMD alignments is similar in Swarm, VSEARCH, and SWIPE, and is based on score matrices that have 32 x 32 elements. This is done to accommodate other potential encodings, including amino acids, where the number of symbols can be 20 or more. Alternative nucleotide encodings use 16 different symbols, including - (no residue) or N (any residue) and all ambiguous symbols, for a total of 16 symbols. Additionally, it is best to keep the size to a power of 2. To be able to use more or less the same SIMD code (which is hard to modify), the size of 32 x 32 is used.

In Swarm, where global alignments are performed, we found it easier to just look at the differences between sequences, measuring the distance between the sequences, in contrast to local alignments where the amount of similarity is measured. The score matrices therefore contain scores that indicate the mismatches or differences. The given gap penalties and match and mismatches scores are therefore converted to a system were only penalties for mismatches and gaps are used, in order to find the same actual alignments. Matches therefore have zero (0) score, while gaps and mismatches indicate penalties. The penalties are positive values, indicating a distance or dissimilarity.

In Swarm I think only a small part of the score matrices are used, the "upper left" corner of 5 x 5 values. I do not think what is outside matters. Even the first (0) row and column is probably not used.

When the diagonal is set to zero, it is because it indicates a match. The remaining elements indicate the penalty for a mismatch. I am not sure why element (0, 0) is set, but I do not think it is used. It would indicate the penalty for aligning to gap symbols.

To avoid compiler warnings, the rest of the matrix should probably be filled with zeros.

I can confirm the the matrix seems correct.

frederic-mahe commented 10 months ago

Thanks! It is much clearer now.

I have a busy week ahead, but if time allows I'll try to:

Once everything is correct, I think we should make a new swarm release (version 3.1.4). I've already started to work on release notes on the dev branch.