Closed yceh closed 1 year ago
Hi @yceh,
Thank you for submitting the bug report.
I could reduce the sequences to "GGCAAGAA" and "CGAAGC".
Alignment:
0 . :
GGCAAGAA---
| | |
--C--G-AAGC
Score Matrix:
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;0 ;-110;-168;-226;-284;-342;-400;-458;-516;
C;-110;-145;-255;-24 ;-134;-192;-250;-308;-366;
G;-168;34 ;-1 ;-111;-169;-227;-48 ;-158;-216;
A;-226;-76 ;-111;-153;34 ;-24 ;-134;97 ;-13 ;
A;-284;-134;-169;-250;-8 ;179 ;69 ;11 ;242 ;
G;-342;-140;10 ;-100;-118;69 ;323 ;213 ;155 ;
C;-400;-250;-100;154 ;44 ;11 ;213 ;171 ;74 ;
Trace Matrix:
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;N ;L ;l ;l ; ; ; ; ; ;
C;U ;DUL ;DUL ;DUl ;L ;l ;l ;l ;l ;
G;u ;DUL ;DuL ;L ;l ;l ;DUl ;L ;l ;
A;u ;UL ;UL ;DuL ;DUL ;DUL ;l ;DUl ;DUL ;
A;u ;uL ;uL ;uL ;DUl ;DUL ;L ;DUl ;DUl ;
G;u ;DuL ;DuL ;L ;Ul ;Ul ;DUL ;L ;l ;
C;u ;uL ;UL ;DUL ;L ;ul ;Ul ;DUL ;uL ;
Path:
u;u;u;d;l;d;l;l;d;l;l;
Score Matrix (Just Path):
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;0 ;-110;-168; ; ; ; ; ; ;
C; ; ; ;-24 ;-134;-192; ; ; ;
G; ; ; ; ; ; ;-48 ;-158; ;
A; ; ; ; ; ; ; ; ;-13 ;
A; ; ; ; ; ; ; ; ;242 ;
G; ; ; ; ; ; ; ; ;155 ;
C; ; ; ; ; ; ; ; ;74 ;
Trace Matrix (Just Path):
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;N ;L ;l ; ; ; ; ; ; ;
C; ; ; ;D ;L ;l ; ; ; ;
G; ; ; ; ; ; ;D ;L ; ;
A; ; ; ; ; ; ; ; ;D ;
A; ; ; ; ; ; ; ; ;U ;
G; ; ; ; ; ; ; ; ;u ;
C; ; ; ; ; ; ; ; ;u ;
seq1 seq2 state q_pos r_pos score
G - INS 0 1 -110 SAME SCORE
G - INS 0 2 -168 SAME SCORE
C C MAT 1 3 -24 SAME SCORE
A - INS 1 4 -134 SAME SCORE
A - INS 1 5 -192 SAME SCORE
G G MAT 2 6 -48 SAME SCORE
A - INS 2 7 -158 SAME SCORE
A A MAT 3 8 -13 SAME SCORE
- A DEL 4 8 -123 DIFFERENT SCORE
- G DEL 5 8 -181 DIFFERENT SCORE
- C DEL 6 8 -239 DIFFERENT SCORE
If you add
#include <seqan3/alignment/aligned_sequence/debug_stream_alignment.hpp>
you can print the alignment:
auto & res = *results.begin();
seqan3::debug_stream << "alignment:\n" << res << "\n";
I found that changing the scoring scheme type to int8_t
(default) results in the same score.
seqan3::nucleotide_scoring_scheme<int8_t>
Mind that with int8_t
the scores underflow. Haven't yet tried changing the scores to something in [-128,127]
Something goes wrong with the traceback. The score is actually correct.
0 . :
GGCAAGAA---
| | |
--C--G-AAGC
The score for this alignment is -239.
Is not optimal, because you could align both A:
0 . :
GGCAAGAA--
| | |
--C--GAAGC
The score for this alignment is 74
.
https://godbolt.org/z/qrePrW67Y
OK, I did some digging.
The Gotoh paper states (section 3 is Traceback
):
The meaning of the parameters is not that explicitly stated in the paper, but my interpretation is:
u
is the gap cost L
seems to refer to the gap weighing function, we have gap open and extension, hence L=2
max d(x,y)
is the maximum score, a mismatch has a positive scoreThe equation seems to be equivalent to gap_open + 2 * gap_extension >= mismatch_score
in our implementation.
E.g.,
nu_scheme.set_simple_scheme(seqan3::match_score{145}, seqan3::mismatch_score{-144});
int const gap_open_cost = -100;
int const gap_extend_cost = -22; // works
int const gap_extend_cost = -23; // does not work
nu_scheme.set_simple_scheme(seqan3::match_score{145}, seqan3::mismatch_score{-146});
int const gap_open_cost = -100;
int const gap_extend_cost = -22; // works
int const gap_extend_cost = -23; // works
In the example, the scores are more sophisticated, I think the most safe would be to go with the most positive mismatch
score. However, it works for a gap_extension of 46
, but not 47
, so only certain costs are involved for this specific alignment.
All in all, it seems that the algorithm does not work because we are in violation of this assumption made by Gotoh.
Edit:
2 * gap_open < max_mismatch
, i.e. 2 gaps are more expensive than a mismatch Hey, so I looked at it today and the problem was that on very few occasions the original up open
signal was overwritten by a better score coming from the left
.
This happened here in the cell [c8:r5], where extending the gap in horizontal direction outscores the diagonal and vertical direction.
However, the optimal alignment comping from up
ended in a gap open which was not captured before.
I fixed this in #3098 by silently tracking the up open direction in case the left direction dominates.
We merged https://github.com/seqan/seqan3/pull/3098 which should fix this issue. Please re-open if the issue has not been resolved for you.
Does this problem persist on the current master?
Is there an existing issue for this?
Current Behavior
For the example attached, seqan report an alignment score of 34240, but the score of the alignment returned is 34202.
Expected Behavior
The reported alignment score and the score calculated from the returned alignment should match.
The attached code prints the alignment score at each column, for a majority of other alignments, the alignment score at the last column matches the alignment score reported by seqan, but it is still entirely possible that I didn't calculate the score correctly or I misunderstood the doc. (and for this particular example, they will match if I trim away the last nucleotide of both reference and query)
Steps To Reproduce
The following code should produce this issue:
Environment
Anything else?
No response