Closed frederic-mahe closed 1 year ago
It's good that you spotted this potential problem. I'll have a look at it next week.
Regarding the secondary question, I had fun this week trying to refactor matrix.cc
. I ended up with a single header file matrix.h (on the dev
branch) containing a template function that creates std::array
score matrices of three different types (unsigned char, unsigned short and int64_t).
Most of the code is checked and optimized away at compile time, the remaining loop is vectorized, the creation-destruction of the arrays is managed automatically by the compiler (no more new
and delete
), and the code passes all clang-tidy checks (zero warning) :-)
matrix.h contains an example of a matrix as produced by the function. Could you please confirm that this is the expected outcome?
Regarding question 1: Yes, please go ahead and add a check for overflow.
Regarding question 2:
Here is some reasoning behind the score matrices:
Swarm only allows the nucleotide symbols A, C, G and T/U, and no ambiguous symbols. It generally uses a simple encoding where A=1, C=2, G=3, and T/U=4. This is shown in db.cc
. I do not think zero (0) is actually used, but it may used to indicate a gap (-
) in alignments.
When the sequences are stored in compressed form in memory using only 2 bits per nucleotide, the encoding A=0, C=1, G=2, and T/U=3 is used. It easy to convert between these two encodings by adding or subtracting 1.
The code for performing the SIMD alignments is similar in Swarm, VSEARCH, and SWIPE, and is based on score matrices that have 32 x 32 elements. This is done to accommodate other potential encodings, including amino acids, where the number of symbols can be 20 or more. Alternative nucleotide encodings use 16 different symbols, including -
(no residue) or N
(any residue) and all ambiguous symbols, for a total of 16 symbols. Additionally, it is best to keep the size to a power of 2. To be able to use more or less the same SIMD code (which is hard to modify), the size of 32 x 32 is used.
In Swarm, where global alignments are performed, we found it easier to just look at the differences between sequences, measuring the distance between the sequences, in contrast to local alignments where the amount of similarity is measured. The score matrices therefore contain scores that indicate the mismatches or differences. The given gap penalties and match and mismatches scores are therefore converted to a system were only penalties for mismatches and gaps are used, in order to find the same actual alignments. Matches therefore have zero (0) score, while gaps and mismatches indicate penalties. The penalties are positive values, indicating a distance or dissimilarity.
In Swarm I think only a small part of the score matrices are used, the "upper left" corner of 5 x 5 values. I do not think what is outside matters. Even the first (0) row and column is probably not used.
When the diagonal is set to zero, it is because it indicates a match. The remaining elements indicate the penalty for a mismatch. I am not sure why element (0, 0) is set, but I do not think it is used. It would indicate the penalty for aligning to gap symbols.
To avoid compiler warnings, the rest of the matrix should probably be filled with zeros.
I can confirm the the matrix seems correct.
Thanks! It is much clearer now.
I have a busy week ahead, but if time allows I'll try to:
Once everything is correct, I think we should make a new swarm
release (version 3.1.4). I've already started to work on release notes on the dev
branch.
The score tables produced by
matrix.cc
are good candidates for a conversion tostd::array
. While looking at the code inscore_matrix_read()
, I've noticed that there is no check that thepenalty_mismatch
value actually fits in 16-bit and 8-bit unsigned integer types.The way
penalty_mismatch
is computed is a bit complicated, but when all other pairwise alignment options are left untouched (default values), predictablepenalty_mismatch
values can be obtained by setting the option--mismatch-penalty
to 7 + 12a, wherea
is a positive integer.For instance:
--mismatch-penalty
= 7,penalty_mismatch
= 10 + 2 * 7 = 24--mismatch-penalty
= 19,penalty_mismatch
= 10 + 2 * 19 = 48--mismatch-penalty
= 31,penalty_mismatch
= 10 + 2 * 31 = 72 …--mismatch-penalty
= 127,penalty_mismatch
= 10 + 2 * 127 = 264That last value is bigger than 255 and will overflow a 8-bit unsigned integer.
To check that, let's compile
swarm
without optimizations:In Emacs, run
gdb
:As expected, score values are set to character
\b
, which has the decimal value 8 (264 mod 256). The penalty for a mismatch is underestimated.A test could be added in
swarm.cc
, functionargs_check()
, at least to check ifparameters.penalty_mismatch
is lower than 256. @torognes Should I do that?Secondary question
score_matrix_read()
produces 32 by 32 tables looking like this 8 by 8 mock-up:score
,@torognes do you confirm that this is the expected outcome?