seqan / seqan3

The modern C++ library for sequence analysis. Contains version 3 of the library and API docs.
https://www.seqan.de
Other
408 stars 83 forks source link

[Alignment] Wrong alignment score #3043

Closed yceh closed 1 year ago

yceh commented 2 years ago

Does this problem persist on the current master?

Is there an existing issue for this?

Current Behavior

For the example attached, seqan report an alignment score of 34240, but the score of the alignment returned is 34202.

Expected Behavior

The reported alignment score and the score calculated from the returned alignment should match.

The attached code prints the alignment score at each column, for a majority of other alignments, the alignment score at the last column matches the alignment score reported by seqan, but it is still entirely possible that I didn't calculate the score correctly or I misunderstood the doc. (and for this particular example, they will match if I trim away the last nucleotide of both reference and query)

Steps To Reproduce

The following code should produce this issue:

#include <seqan3/alignment/configuration/align_config_gap_cost_affine.hpp>
#include <utility>

#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/alignment/configuration/all.hpp>
#include <vector>

int main(int argc,char** argv){
    using namespace seqan3::literals;
    seqan3::dna4_vector ref="CCCTATCGTAGGTTGTCAACCACCAGGTATACTTGAGGATCCTTCGTCGGACATTTATCACGTATCTCATATAGGTTGCTACTCATGGGACAGTATGGCTACTGGCAGGGTTGCAAAGTTATTCCATTCTCTACGAGTTCACGTAAGGTTTGGATGCTACTCACCAGATATCTGTTAAATGGCGCCCCGCGTATACCGTGGTTCCCAACTTAGCAAGTTTATTTAATGGCAAGCATGGAGCCTCGGACCCTCCCGGAGGACTTAAGTAGGATGGTAGCGCTGCGGTTGCGTGACGACCCGTCGACGTAAAAACATTTTCTTTACGGGAAGCTAGACTACGGACGCTGTCACAAAATATCGGAAGTGACGCGCCTGGCATATATCCTTGGCGTTAACGGAACCCCCAGTGTGACGTTATTGCGGGATCTTATAATGGAGTTGTGAGCAGACCGCTTGTTTGTTCGACCGGGCAAGGGAAAAAAAAAAAAAAAAAAAAAAAA"_dna4;;
    seqan3::dna4_vector query="GCCCCACTCGTAGAGTGTCCACCAGGTACTAGGAAGACGCTTCATGCGCTTTTCCTACATGAATTGGCAGAATTCATGGACAGTTCAATGGCACTGGACAGTGTTCTTTGGCTTAAAGCTTGATATCCGAATTTCTCTACTGAGTCTGTCCTGTAAGGTTGAGTGTCGGATGCTCTCATCATATCTGTACATGGAACGCGTCCCCCGGTATCCTGCGGTCCCCAACTGTACAGTTATTTTATGTGCAACAGCATTCGGAGGCCCTCGACCTCCTGCGACTTAAGTAGAGGCGCCATGCGTGTGAATCGACACCTACCGCGACTAAAAAACTTTACTCTGGGCACGGGATACACTACGGAACCTGTCACTAAATGACGGAGAAGGCTGATGAATCCGCGGCAACCTACTAGTTAAGGCGGAGAAAGGAGACAGGTTTGCGGGATTTCATATTGGGAAGAGTTAGTGACAGCCCTTTTTGTTCAATATCCCCGCGAAAGGCT"_dna4;

    seqan3::dna4 letter;

    seqan3::nucleotide_scoring_scheme<int> nu_scheme;
    nu_scheme.score('A'_dna4,'A'_dna4)=145;
    nu_scheme.score('A'_dna4,'T'_dna4)=-144;
    nu_scheme.score('A'_dna4,'G'_dna4)=-153;
    nu_scheme.score('A'_dna4,'C'_dna4)=-152;

    nu_scheme.score('T'_dna4,'A'_dna4)=-144;
    nu_scheme.score('T'_dna4,'T'_dna4)=144;
    nu_scheme.score('T'_dna4,'G'_dna4)=-143;
    nu_scheme.score('T'_dna4,'C'_dna4)=-140;

    nu_scheme.score('G'_dna4,'A'_dna4)=-153;
    nu_scheme.score('G'_dna4,'T'_dna4)=-143;
    nu_scheme.score('G'_dna4,'G'_dna4)=144;
    nu_scheme.score('G'_dna4,'C'_dna4)=-145;

    nu_scheme.score('C'_dna4,'A'_dna4)=-152;
    nu_scheme.score('C'_dna4,'T'_dna4)=-140;
    nu_scheme.score('C'_dna4,'G'_dna4)=-145;
    nu_scheme.score('C'_dna4,'C'_dna4)=144;

    int gap_open_cost=-52;
    int gap_extend_cost=-58;
    auto new_gap_cost=gap_extend_cost+gap_open_cost;
    // Configure the alignment kernel.
    auto config =
        seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{nu_scheme}|
        seqan3::align_cfg::gap_cost_affine{
            seqan3::align_cfg::open_score{gap_open_cost},
            seqan3::align_cfg::extension_score{gap_extend_cost}};

    auto results = seqan3::align_pairwise(std::tie(ref, query), config);
    auto & res = *results.begin();
    seqan3::debug_stream<<"aligned_ref:\n"<<std::get<0>(res.alignment())<<"\n";
    seqan3::debug_stream<<"aligned_query:\n"<<std::get<1>(res.alignment())<<"\n";

      auto seq1 = std::get<0>(res.alignment());
      auto seq2_iter = std::get<1>(res.alignment()).begin();

      int score_so_far = 0;
      int q_pos = 0;
      int r_pos = 0;
      enum {M,I,D} state=M;
      seqan3::debug_stream<<"seq1\tseq2\tq_pos\tr_pos\tscore\n";
      for (auto seq1_iter = seq1.begin(); seq1_iter < seq1.end(); seq1_iter++) {
        if (*seq1_iter == seqan3::gap{}) {
          q_pos++;
          if (state==D) {
            score_so_far += gap_extend_cost;
          } else {
            score_so_far += new_gap_cost;
          }
          state=D;
        } else if (*seq2_iter == seqan3::gap{}) {
          r_pos++;
          if (state==I) {
            score_so_far += gap_extend_cost;
          } else {
            score_so_far += new_gap_cost;
          }
          state=I;
        } else {
          r_pos++;
          q_pos++;
          seqan3::alphabet_variant<seqan3::dna4, seqan3::gap> seq1_ele =
              *seq1_iter;
          seqan3::alphabet_variant<seqan3::dna4, seqan3::gap> seq2_ele =
              *seq2_iter;
          score_so_far += nu_scheme.score(seq1_ele.convert_to<seqan3::dna4>(),
                                          seq2_ele.convert_to<seqan3::dna4>());
          state=M;
        }
        seqan3::debug_stream<<*seq1_iter<<'\t'<<*seq2_iter<<'\t'<<q_pos<<'\t'<<r_pos<<'\t'<<score_so_far<<'\n';
        seq2_iter++;
      }
      seqan3::debug_stream<<"alignemnt score in result:"<<res.score()<<"\n";
      return 0;
}

Environment

- Operating system:Ubuntu 20.04
- SeqAn version:e51658dfa5ae7ee932cecbc41eb4d16c710d1f07
- Compiler:gcc 10.3.0

Anything else?

No response

marehr commented 2 years ago

Hi @yceh,

Thank you for submitting the bug report.

I could reduce the sequences to "GGCAAGAA" and "CGAAGC".

Alignment:

    0     .    :
      GGCAAGAA---
        |  | |
      --C--G-AAGC

Score Matrix:

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;0   ;-110;-168;-226;-284;-342;-400;-458;-516;
C;-110;-145;-255;-24 ;-134;-192;-250;-308;-366;
G;-168;34  ;-1  ;-111;-169;-227;-48 ;-158;-216;
A;-226;-76 ;-111;-153;34  ;-24 ;-134;97  ;-13 ;
A;-284;-134;-169;-250;-8  ;179 ;69  ;11  ;242 ;
G;-342;-140;10  ;-100;-118;69  ;323 ;213 ;155 ;
C;-400;-250;-100;154 ;44  ;11  ;213 ;171 ;74  ;

Trace Matrix:

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;N   ;L   ;l   ;l   ;    ;    ;    ;    ;    ;
C;U   ;DUL ;DUL ;DUl ;L   ;l   ;l   ;l   ;l   ;
G;u   ;DUL ;DuL ;L   ;l   ;l   ;DUl ;L   ;l   ;
A;u   ;UL  ;UL  ;DuL ;DUL ;DUL ;l   ;DUl ;DUL ;
A;u   ;uL  ;uL  ;uL  ;DUl ;DUL ;L   ;DUl ;DUl ;
G;u   ;DuL ;DuL ;L   ;Ul  ;Ul  ;DUL ;L   ;l   ;
C;u   ;uL  ;UL  ;DUL ;L   ;ul  ;Ul  ;DUL ;uL  ;

Path:

u;u;u;d;l;d;l;l;d;l;l;

Score Matrix (Just Path):

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;0   ;-110;-168;    ;    ;    ;    ;    ;    ;
C;    ;    ;    ;-24 ;-134;-192;    ;    ;    ;
G;    ;    ;    ;    ;    ;    ;-48 ;-158;    ;
A;    ;    ;    ;    ;    ;    ;    ;    ;-13 ;
A;    ;    ;    ;    ;    ;    ;    ;    ;242 ;
G;    ;    ;    ;    ;    ;    ;    ;    ;155 ;
C;    ;    ;    ;    ;    ;    ;    ;    ;74  ;

Trace Matrix (Just Path):

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;N   ;L   ;l   ;    ;    ;    ;    ;    ;    ;
C;    ;    ;    ;D   ;L   ;l   ;    ;    ;    ;
G;    ;    ;    ;    ;    ;    ;D   ;L   ;    ;
A;    ;    ;    ;    ;    ;    ;    ;    ;D   ;
A;    ;    ;    ;    ;    ;    ;    ;    ;U   ;
G;    ;    ;    ;    ;    ;    ;    ;    ;u   ;
C;    ;    ;    ;    ;    ;    ;    ;    ;u   ;
seq1    seq2    state   q_pos   r_pos   score
G   -   INS 0   1   -110    SAME SCORE
G   -   INS 0   2   -168    SAME SCORE
C   C   MAT 1   3   -24 SAME SCORE
A   -   INS 1   4   -134    SAME SCORE
A   -   INS 1   5   -192    SAME SCORE
G   G   MAT 2   6   -48 SAME SCORE
A   -   INS 2   7   -158    SAME SCORE
A   A   MAT 3   8   -13 SAME SCORE
-   A   DEL 4   8   -123    DIFFERENT SCORE
-   G   DEL 5   8   -181    DIFFERENT SCORE
-   C   DEL 6   8   -239    DIFFERENT SCORE
eseiler commented 2 years ago

If you add

#include <seqan3/alignment/aligned_sequence/debug_stream_alignment.hpp>

you can print the alignment:

auto & res = *results.begin();
seqan3::debug_stream << "alignment:\n" << res << "\n";
eseiler commented 2 years ago

I found that changing the scoring scheme type to int8_t (default) results in the same score.

seqan3::nucleotide_scoring_scheme<int8_t>

Mind that with int8_t the scores underflow. Haven't yet tried changing the scores to something in [-128,127]

eseiler commented 2 years ago

Something goes wrong with the traceback. The score is actually correct.

    0     .    :
      GGCAAGAA---
        |  | |
      --C--G-AAGC

The score for this alignment is -239.

Is not optimal, because you could align both A:

    0     .    :
      GGCAAGAA--
        |  | |
      --C--GAAGC

The score for this alignment is 74.

eseiler commented 2 years ago

https://godbolt.org/z/qrePrW67Y

Some debug code ```cpp #include #include #include #include #include #include #include #include #include #include int main(int argc, char ** argv) { using namespace seqan3::literals; seqan3::dna4_vector ref = "GGCAAGAA"_dna4; seqan3::dna4_vector query = "CGAAGC"_dna4; seqan3::dna4 letter; seqan3::nucleotide_scoring_scheme nu_scheme; nu_scheme.score('A'_dna4, 'A'_dna4) = 145; nu_scheme.score('A'_dna4, 'T'_dna4) = -144; nu_scheme.score('A'_dna4, 'G'_dna4) = -153; nu_scheme.score('A'_dna4, 'C'_dna4) = -152; nu_scheme.score('T'_dna4, 'A'_dna4) = -144; nu_scheme.score('T'_dna4, 'T'_dna4) = 144; nu_scheme.score('T'_dna4, 'G'_dna4) = -143; nu_scheme.score('T'_dna4, 'C'_dna4) = -140; nu_scheme.score('G'_dna4, 'A'_dna4) = -153; nu_scheme.score('G'_dna4, 'T'_dna4) = -143; nu_scheme.score('G'_dna4, 'G'_dna4) = 144; nu_scheme.score('G'_dna4, 'C'_dna4) = -145; nu_scheme.score('C'_dna4, 'A'_dna4) = -152; nu_scheme.score('C'_dna4, 'T'_dna4) = -140; nu_scheme.score('C'_dna4, 'G'_dna4) = -145; nu_scheme.score('C'_dna4, 'C'_dna4) = 144; int gap_open_cost = -52; int gap_extend_cost = -58; auto new_gap_cost = gap_extend_cost + gap_open_cost; // Configure the alignment kernel. auto config = seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{nu_scheme} | seqan3::align_cfg::gap_cost_affine{seqan3::align_cfg::open_score{gap_open_cost}, seqan3::align_cfg::extension_score{gap_extend_cost}} | seqan3::align_cfg::detail::debug{}; auto results = seqan3::align_pairwise(std::tie(ref, query), config); auto & res = *results.begin(); seqan3::debug_stream << res.alignment() << std::endl; seqan3::debug_stream << seqan3::detail::debug_matrix{res.score_matrix()} << std::endl; seqan3::debug_stream << seqan3::detail::debug_matrix{res.trace_matrix()} << std::endl; // seqan3::debug_stream<<"aligned_ref:\n"<(res.alignment())<<"\n"; // seqan3::debug_stream<<"aligned_query:\n"<(res.alignment())<<"\n"; auto seq1 = std::get<0>(res.alignment()); auto seq2_iter = std::get<1>(res.alignment()).begin(); int score_so_far = 0; int q_pos = 0; int r_pos = 0; enum { M, I, D } state = M; seqan3::debug_stream << "seq1\tseq2\tstate\tq_pos\tr_pos\tscore\n"; for (auto seq1_iter = seq1.begin(); seq1_iter < seq1.end(); seq1_iter++) { if (*seq1_iter == seqan3::gap{}) { q_pos++; if (state == D) { score_so_far += gap_extend_cost; } else { score_so_far += new_gap_cost; } state = D; } else if (*seq2_iter == seqan3::gap{}) { r_pos++; if (state == I) { score_so_far += gap_extend_cost; } else { score_so_far += new_gap_cost; } state = I; } else { r_pos++; q_pos++; seqan3::alphabet_variant seq1_ele = *seq1_iter; seqan3::alphabet_variant seq2_ele = *seq2_iter; score_so_far += nu_scheme.score(seq1_ele.convert_to(), seq2_ele.convert_to()); state = M; } seqan3::debug_stream << *seq1_iter << '\t' << *seq2_iter << '\t' << (state == D ? "DEL" : (state == I ? "INS" : "MAT")) << '\t' << q_pos << '\t' << r_pos << '\t' << score_so_far << '\n'; seq2_iter++; } seqan3::debug_stream << "alignemnt score in result:" << res.score() << ", " << score_so_far << "\n"; return res.score() == score_so_far ? 0 : -1; } ``` Alignment: ``` 0 . : GGCAAGAA--- | | | --C--G-AAGC ``` Score Matrix: ``` ; ;G ;G ;C ;A ;A ;G ;A ;A ; ;0 ;-110;-168;-226;-284;-342;-400;-458;-516; C;-110;-145;-255;-24 ;-134;-192;-250;-308;-366; G;-168;34 ;-1 ;-111;-169;-227;-48 ;-158;-216; A;-226;-76 ;-111;-153;34 ;-24 ;-134;97 ;-13 ; A;-284;-134;-169;-250;-8 ;179 ;69 ;11 ;242 ; G;-342;-140;10 ;-100;-118;69 ;323 ;213 ;155 ; C;-400;-250;-100;154 ;44 ;11 ;213 ;171 ;74 ; ``` Trace Matrix: ``` ; ;G ;G ;C ;A ;A ;G ;A ;A ; ;N ;L ;l ;l ; ; ; ; ; ; C;U ;DUL ;DUL ;DUl ;L ;l ;l ;l ;l ; G;u ;DUL ;DuL ;L ;l ;l ;DUl ;L ;l ; A;u ;UL ;UL ;DuL ;DUL ;DUL ;l ;DUl ;DUL ; A;u ;uL ;uL ;uL ;DUl ;DUL ;L ;DUl ;DUl ; G;u ;DuL ;DuL ;L ;Ul ;Ul ;DUL ;L ;l ; C;u ;uL ;UL ;DUL ;L ;ul ;Ul ;DUL ;uL ; ``` Path: ``` u;u;u;d;l;d;l;l;d;l;l; ``` Score Matrix (Just Path): ``` ; ;G ;G ;C ;A ;A ;G ;A ;A ; ;0 ;-110;-168; ; ; ; ; ; ; C; ; ; ;-24 ;-134;-192; ; ; ; G; ; ; ; ; ; ;-48 ;-158; ; A; ; ; ; ; ; ; ; ;-13 ; A; ; ; ; ; ; ; ; ;242 ; G; ; ; ; ; ; ; ; ;155 ; C; ; ; ; ; ; ; ; ;74 ; ``` Trace Matrix (Just Path): ``` ; ;G ;G ;C ;A ;A ;G ;A ;A ; ;N ;L ;l ; ; ; ; ; ; ; C; ; ; ;D ;L ;l ; ; ; ; G; ; ; ; ; ; ;D ;L ; ; A; ; ; ; ; ; ; ; ;D ; A; ; ; ; ; ; ; ; ;U ; G; ; ; ; ; ; ; ; ;U ; C; ; ; ; ; ; ; ; ;u ; ``` ``` seq1 seq2 state q_pos r_pos score G - INS 0 1 -110 SAME SCORE G - INS 0 2 -168 SAME SCORE C C MAT 1 3 -24 SAME SCORE A - INS 1 4 -134 SAME SCORE A - INS 1 5 -192 SAME SCORE G G MAT 2 6 -48 SAME SCORE A - INS 2 7 -158 SAME SCORE A A MAT 3 8 -13 SAME SCORE - A DEL 4 8 -123 DIFFERENT SCORE - G DEL 5 8 -181 DIFFERENT SCORE - C DEL 6 8 -239 DIFFERENT SCORE ```
eseiler commented 1 year ago

OK, I did some digging. The Gotoh paper states (section 3 is Traceback):

gotoh

The meaning of the parameters is not that explicitly stated in the paper, but my interpretation is:

The equation seems to be equivalent to gap_open + 2 * gap_extension >= mismatch_score in our implementation.

E.g.,


nu_scheme.set_simple_scheme(seqan3::match_score{145}, seqan3::mismatch_score{-144});
int const gap_open_cost = -100;
int const gap_extend_cost = -22; // works
int const gap_extend_cost = -23; // does not work

nu_scheme.set_simple_scheme(seqan3::match_score{145}, seqan3::mismatch_score{-146});
int const gap_open_cost = -100;
int const gap_extend_cost = -22; // works
int const gap_extend_cost = -23; // works

In the example, the scores are more sophisticated, I think the most safe would be to go with the most positive mismatch score. However, it works for a gap_extension of 46, but not 47, so only certain costs are involved for this specific alignment.

All in all, it seems that the algorithm does not work because we are in violation of this assumption made by Gotoh.

Edit:

rrahn commented 1 year ago

Hey, so I looked at it today and the problem was that on very few occasions the original up open signal was overwritten by a better score coming from the left. This happened here in the cell [c8:r5], where extending the gap in horizontal direction outscores the diagonal and vertical direction. However, the optimal alignment comping from up ended in a gap open which was not captured before. I fixed this in #3098 by silently tracking the up open direction in case the left direction dominates.

eseiler commented 1 year ago

We merged https://github.com/seqan/seqan3/pull/3098 which should fix this issue. Please re-open if the issue has not been resolved for you.