torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Alignments use a slightly too large gap extension penalty when d>1 #95

Closed torognes closed 7 years ago

torognes commented 7 years ago

There is a bug in Swarm that affects the gap extension penalty in alignments. In comes in effect when d>1. There is an error in the conversion of the scoring system which leads Swarm in effect to use a slightly higher gap extension penalty than specified. The default scoring system of match score 5, mismatch score -4, gap open penalty 12 and gap extension penalty 4 in effect gives a gap extension penalty of 4.5. We are working on a solution.

torognes commented 7 years ago

This issue has been resolved in Swarm version 2.1.10 just released.

frederic-mahe commented 7 years ago

hi @torognes , would you happen to have a toy example I could use to make a regression test for that particular issue?

torognes commented 7 years ago

This bug was described by Robert Mueller in his email of 8 December 2016. He wrote:

The second issue concerns the way, in which you transform the scoring function. [...], it looks like the formula for penalty_gapextend is slightly wrong. Instead of (2 matchscore + gapextend) / penalty_factor it should rather be (matchscore + 2 gapextend) / penalty_factor. All other parts of the transformation seem fine. I can also support my point with an example. Consider the sequences 'ctattgttgtc' and 'tctatgtgtct'. According to your transformation swarm's default scoring function (5,4,12,4) is transformed into (0,9,12,7), while my transformation leads to (0,18,24,13) (which can be obtained with swarm by changing gapextend to 3). Only swarm using (0,9,12,7) after transformation states that the optimal alignment involves 8 differences (it simply (mis)matches all symbols, no indels). My own algorithms using (5,4,12,4) resp. (0,18,24,13) as well as swarm using (0,18,24,13) after transformation all agree on an optimal alignment with only 5 differences such as -ctattgttgtc- tctat--gtgtct

The sequences 'ctattgttgtc' and 'tctatgtgtct' should work as a toy example. The resulting alignment and score should indicate whether the bug is present or not.