torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Incorrect results with long sequences, very high gap penalties, and d>1 #160

Closed torognes closed 3 years ago

torognes commented 3 years ago

Tests have revealed that swarm 3.0.0 produces incorrect results on the ARM64 architecture (both Linux and Mac) with long sequences and d>2. For example, 2 sequences of 2515 bp (or longer) with a single substitution forms 2 instead of 1 cluster when d=2 or higher. Other architectures are not affected, and d=1 works well.

frederic-mahe commented 3 years ago

good catch! I've just pushed a new test for that particular case (https://github.com/frederic-mahe/swarm-tests/commit/7fbd5feef4466aeb6355bd5c4d13475775e09e1a). If you create more toy examples, I'd be happy to add them to our test suite.

torognes commented 3 years ago

Thanks. The bug was actually detected by the existing test that analyses 5000 bp sequences and d=2, the penultimate test in test_input.sh

There was a typo in my previous comment. The bug appears when d>1.

frederic-mahe commented 3 years ago

There was a typo in my test too :-) I'll keep the new test, even if it is redundant. It is written slightly differently and it can potentially trigger another kind of bug.

torognes commented 3 years ago

After further investigations, it seems like this bug is unfortunately present also on other architectures, including x86_64 at least, but only appear when 16-bit alignments are used (in search16.cc) on sequences of at least 2515 bp. Swarm on ARM64 always uses 16-bit alignments and the bug was therefore more visible there, otherwise 16-bit alignments are only used with very high gap penalties. The 16 bits refer to the width of the register used for computing the alignment scores. Still only applies to d>1.

torognes commented 3 years ago

The error seems to have been introduced in version 2.1.11 (commit 9db1684) from January 16th, 2017. Version 2.1.10 appears ok.

After further investigations, it seems like older versions have problems too, but perhaps with even longer sequences.

torognes commented 3 years ago

The bug seems to be triggered by this example command line with rather high gap penalties:

swarm test.fasta -d 2 -g 50 -e 20

with the input file below consisting of two sequences of 1122 bp each with only a single difference:

>a_1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaa
>c_2
caaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaa

The result is 2 clusters instead of 1.

torognes commented 3 years ago

I think the error must be caused by undetected overflow in 16 bit alignment scores in the SIMD code.

torognes commented 3 years ago

Should be fixed in commit 729c50054b28cf87961ab5efd50de167eca855b8. Also cleaned up the use of intrinsics and separated code using ssse3, sse41 and popcnt into separated files for compilation on x86_64.

frederic-mahe commented 3 years ago

All tests are green, nice work! I think that issue can be closed.