rvaser / spoa

SIMD partial order alignment tool/library
MIT License
158 stars 32 forks source link

Unexpected alignment results #63

Closed paoloshasta closed 1 year ago

paoloshasta commented 1 year ago

The attached program uses spoa to compute an MSA of 6 sequences. The sequences only differ in the length of a homopolymer run consisting of a number of T's between 14 and 19. Each sequence is entered in the alignment with a weight. The sequences are entered in order of decreasing weight.

The input sequences and weights are as follows:

    {"GACAACCTGTTTTTTTTTTTTTTTTGAGA", 6},    // T16, weight 6
    {"GACAACCTGTTTTTTTTTTTTTTTGAGA", 5},     // T15, weight 5
    {"GACAACCTGTTTTTTTTTTTTTTGAGA", 4},      // T14, weight 4
    {"GACAACCTGTTTTTTTTTTTTTTTTTTGAGA", 3},  // T18, weight 3
    {"GACAACCTGTTTTTTTTTTTTTTTTTTTGAGA", 2}, // T19, weight 2
    {"GACAACCTGTTTTTTTTTTTTTTTTTGAGA", 2},   // T17, weight 2

The computed consensus is

GACAACCTGTTTTTTTTTTTTTTTTTTGAGA

This has T18. This is unexpected, as the sequence with T18 has weight 3, while sequences shorter than T18 have total weight 17, and sequences longer than T18 have weight just 2. The weighted average of the input lengths of the homopolymer runs is 16.05. So I would have expected the consensus to have T16, or possibly T17, but certainly not T18.

Any suggestion on how to improve on this result? The reason for entering the sequences in order of decreasing weight is the result of some discussion we had in another issue a long time ago (when I was operating as GitHub user @paoloczi).

See the attached code for more details. It is a modified version of the test program on your README page. This was built with spoa from tag 4.0.8 and the following command line:

g++ -std=c++17 spoaTest.cpp -lspoa

spoaTest.cpp.gz

ksahlin commented 1 year ago

I am interested in how this works too. From what I recall (long time ago, I may be wrong) SPOA simply traversed the path with the highest sum of weights.

Based on this I would assume (if alignments are left shifted in the homopolymer region) that the consensus would be 19T, since the path for 19T has T-2->T-2->G (sum of weight 4) in the end of the region, but the path for 18T has a single edge of weight 3.

My assumption that alignments of the homopolymers are all left shifted may not be the case here, based on the order they are added.

rvaser commented 1 year ago

I have attached the constructed graph (consensus in gold) and below is the MSA with node identifiers (red) for the Ts in question. The graph structure and edge weights correspond to the MSA (edge weight is the sum of both node weights). The heaviest bundle consensus algorithm goes from left to right over a topologically sorted graph, picking the best scoring ingoing edge without looking at the predecessor score (the predecessor score is only considered if there is a tie).

GACAACCTG ---TTT TTTTTTTTTTTTTGAGA 6
GACAACCTG ----TT TTTTTTTTTTTTTGAGA 5
GACAACCTG -----T TTTTTTTTTTTTTGAGA 4
GACAACCTG -TTTTT TTTTTTTTTTTTTGAGA 3
GACAACCTG TTTTTT TTTTTTTTTTTTTGAGA 2
GACAACCTG --TTTT TTTTTTTTTTTTTGAGA 2
          123456
GACAACCTG -TTTTT TTTTTTTTTTTTTGAGA Consensus

graph

For this particular case, right aligning gaps (or doing the consensus traversal in the opposite direction) should get the T16 consensus. I am not sure if there is a counter example.

GACAACCTGTTTTTTTTTTTTT TTT--- GAGA 6
GACAACCTGTTTTTTTTTTTTT TT---- GAGA 5
GACAACCTGTTTTTTTTTTTTT T----- GAGA 4
GACAACCTGTTTTTTTTTTTTT TTTTT- GAGA 3
GACAACCTGTTTTTTTTTTTTT TTTTTT GAGA 2
GACAACCTGTTTTTTTTTTTTT TTTT-- GAGA 2
                       123456
GACAACCTGTTTTTTTTTTTTT TTT--- GAGA Consensus

rgraph

rvaser commented 1 year ago

If the weight of the 3rd sequence is increased to 7 and we shift it to 1st place (avg |T| = 15.8), neither approach gets the T16 answer.

GACAACCTG -----T TTTTTTTTTTTTTGAGA 7
GACAACCTG ---TTT TTTTTTTTTTTTTGAGA 6
GACAACCTG ----TT TTTTTTTTTTTTTGAGA 5
GACAACCTG -TTTTT TTTTTTTTTTTTTGAGA 3
GACAACCTG TTTTTT TTTTTTTTTTTTTGAGA 2
GACAACCTG --TTTT TTTTTTTTTTTTTGAGA 2
          123456
GACAACCTG -TTTTT TTTTTTTTTTTTTGAGA Consensus

graph

GACAACCTGTTTTTTTTTTTTT T----- GAGA 7
GACAACCTGTTTTTTTTTTTTT TTT--- GAGA 6
GACAACCTGTTTTTTTTTTTTT TT---- GAGA 5
GACAACCTGTTTTTTTTTTTTT TTTTT- GAGA 3
GACAACCTGTTTTTTTTTTTTT TTTTTT GAGA 2
GACAACCTGTTTTTTTTTTTTT TTTT-- GAGA 2
                       123456
GACAACCTGTTTTTTTTTTTTT T----- GAGA Consensus

rgraph

paoloshasta commented 1 year ago

Thank you for the detailed analysis - this is very interesting and illustrates the complexity of what is going on here.

Is it possible to translate this into a general recipe for a more robust consensus computation?

rvaser commented 1 year ago

Not sure how to tweak the current algorithm for all the edge cases, maybe a different algorithm is better.

paoloshasta commented 1 year ago

Yes, I could use the details of the spoa alignment to compute a consensus myself, instead of using spoa::Graph::GenerateConsensus(). Thank you for your insight.