smarco / WFA2-lib

WFA-lib: Wavefront alignment algorithm library v2
Other
159 stars 36 forks source link

suboptimal alignment in endsfree mode when match score != 0 #102

Open joyeuxnoel8 opened 2 hours ago

joyeuxnoel8 commented 2 hours ago

Hi,

I notice that the program can return a suboptimal alignment in endsfree mode when match score != 0. Given a pattern (sequencing read) GGGGCGCGTCGGGCTCCGGGTGTGGGGGGGGTGTGGGGGGGGGGGGTGGTGTGTGGGGGTGTGGCTGGTGAATGGGGTGAGGGTGGTGAGTGGAGTGAGGGTGGTGAGTGGGGTGAGGGTGGTGAATGGGGTGAGGGTGGTGAGTGGGGT, and text (repetitive DNA in tandem repeat region) CCTGAGGCCCCGGGTGTGGAGCGGAGGTGGACCAGAGGTGGACACAGACCCACGGGCCGCCAAGGCCCACCCAGGATCCCCCGGGGGCCATCCACATCTGGTAAAGCCGAGGTGTGGGCGGACCCCAGGAAGCAGCCCCCACCCCTGCCCCCAGTGGCTCAGGCCTGGGCAGAGAAAACAGGCCCAGCAGGGCGGCAGGGTGGGATCCCCACGATTCACCGAGGATGCGTCTTCCACAGGGAGAGTTTGGGGGAGCTGTGTGTGAAAATGTGAGTAACGTACATAAATCAGTATCACAGGAATCCAGGCGGGCGGAGGATGCATGACTGAACTTGGAGGACGCTCATCAGGGAGGTCAGTGCTCCCCTCCGGGGACAGGATCCTGCCTTCGCCTGGCCTGCGGGACAGGGCTCCCCTTGCCGGCCAGGGGCTACTGGCCACTGATGCTCACTTTGGGCTTCCGCCCCCCAGGGGAAGGGGTGCTGAGAGCCCCGTGTCCGGAGGGCTGGTGAGTGGGGCTGAGGCTGGTGGAGTGGGGGTGAGGCTGGTGAATGGGGTGAGGGTGGTGAGTGGAGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGAGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGAGTGAGGGTGGTGTGGGTGAGTGGTGAGTGGAGTGAGGGTGGTGAGTGGGGGTGAGGGTGATGAGTGGGGTGAGGGTGGTGAGTGGAGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGTGAGGGTGGTGAATGGGGTGAGGCTGGTGAATGGGGTGAGGGTGATGAGTGGGGTGAGGGTGGTGAGTGGGGTGAGGCTGGTGAGTGGGGGTGAGGGTGATGAGTGGGGTGAGGGTGGTGAGTGGGGTGAGGCTGGTGAGTGGGGGTGAGGGTGGTGAATGGAGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGGTGAGGCTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGATGAGTGGGGTGAGGGTGGTGAGTGGAGTGAGGGTGGTGAATGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGTGAGGGTGGTGAATGGGGTGAGGGTGATGAGTGGAGTGAGGGTGATGAGTGGGGTGAGGGTGGTGAGTGGAGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGATGAGTGGGGTGAGGGTGGTGAGTGGAGTGAGGGTGGTGAATGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGTGAGGGTGGTGAATGGGGTGAGGCTGGTGAATGGGGTGAGGGTGATGAGTGGGGTGAGGGTGGTGAGTGGGGTGAGGCTGGTGAGTGGGGGTGAGGGTGATGAGTGGGGTGAGGGTGGTGAGTGGGGTGAGGCTGGTGAGTGGGGGTGAGGGTGGTGAGTGGAGTGAGGGTGGTGAGTGGGGGTGAGGCTGGTGAATGGGGTGAGGGTGGTGAGTGGGGTGAGGCTGGTGAGTGGGGGTGAGGCTGGTGAATGGGGTGAGGGTGGTGAGTGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGATGAGTGGGGTGAGGGTGGTGAGTGGGGTGAGGCTGGTGAGTGGGGGTGAGGGTGGTGAGTGGAGTGAGGGTGGTGAGTGGGGGTGAGGGTGGTGAGTGGGGGTGAGGCTGGTGAGTGGGGTGAGGGTAGTGGGTGGGGCTGAGGTTATTCCAGCCTCGGGCACTGGATCTTCTCGGGGTGGGGGGGTTTGTGAGCGCTGACCCCCTGGGCTGTCTCCACCTTGTCCTGGGGCTGGGTCCCCGGACGACGCGGCCACAGCTCCTGGGAGAGTGGCCAGCCCTCGGACAGCTGTGAGCCCCCACGGGGGTGTCTGGGTTCGAGGCCACGTTGCAGACCCGCTGGCTGCTGGGGCTCAGGGAGGAAATGACCTGGCCTCCTGGAGCTTCAGATTCCTCATCTGTGTGCTGAGGGAAGGGGCACATCTCGGAGCCTGGGGACTCCCGGCGTGTGGGCTGCTTGCCTGGCACCCGCTCACCCAGGAGTTGTCCTTGCTGTGGGCTCTGAGCCTCCGGGATGGAGTGGGGCTGAGAGCGTGTCCACCACCTCCACCACATCAGCCTGTCCCTGGTCCTGCTCCGCCAGATGACAAATCTCTGGGAAATCTTCTTTAATTTTGTTCTCTGGGAAGTGGTAGGTTTTGGAGA, the output has an alignment score of 136 with the CIGAR string being DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDMMMMXMXMXMXMMXXXXXMMXMMMXMMMXMMMMXMXMMMMXMXMMMMMMMMXMMMMMMMMMXMMXMMXMMMXMMMMMMMMMMMMMMMMMMMMMXMMMMMMXMMMMMMMMMMMMDMMMMMMMMMMMMXMMMXMMMMMMMMMMMMMMMMMMMXDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD.

The optimal alignment should have a score of 160 against this substring in text GGGGTGAGGCTGGTGAATGGGGTGAGGGTGGTGAGTGGGGTGAGGCTGGTGAGTGGGGGTGAGGCTGGTGAATGGGGTGAGGGTGGTGAGTGGGGTGAGGGTGGTGAGTGGGGGTGAGGGTGATGAGTGGGGTGAGGGTGGTGAGTGGGGT with the following CIGAR string MMMMXMXMXMXMMXXXXXMMXMMMXMMMXMMMMXMXMMMMXMXMMXMMMMMXMMMMMMMMMXMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMXMMMMMMMMMMMMMMMDMMMMMMMMMMMMXMMMXMMMMMMMMMMMMMMMMMMMMMMMM.

Upon inspection, the suboptimal alignment has a better prefix compared to the optimal alignment as shown below (replaced M with = for visualization): ====X=X=X=X==XXXXX==X===X===X====X=X====X=X==X=====X=========X===============================X===============D============X===X======================== optimal cigar ====X=X=X=X==XXXXX==X===X===X====X=X====X=X========X=========X==X==X===X=====================X======X============D========M===X===X===================X suboptimal cigar It seems like a nonzero match score plus a better prefix in the suboptimal substring causes the program to keep extending the suboptimal wavefront as long as the alignment score doesn't drop below the second best wavefront. Not sure how easily this can be fixed. My initial thought is maybe we need to consider the potential of each wavefront to surpass the best wavefront under nonzero match scoring scheme.

This test was performed with the following c++ template

#include "bindings/cpp/WFAligner.hpp"
WFAlignerGapAffine aligner(-2,4,6,2,WFAligner::Alignment,WFAligner::MemoryHigh);
int freelen = text.size() - pattern.size();
aligner.alignEndsFree(text, freelen, freelen, pattern, 0, 0);
cigar = aligner.getAlignment();
score = aligner.getAlignmentScore();

Thanks, Tony

joyeuxnoel8 commented 2 hours ago

Another note, I tried using match score=0 with WFAlignerGapAffine aligner(0,4,6,2,WFAligner::Alignment,WFAligner::MemoryHigh); for the above example, but the program didn't return any reasonable alignment and just aligned pattern to the end of the text. The output alignment score is -310 with CIGAR DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDXMMMMXMIMXXMXMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMMMMXMMMXMMXMMIIIIIIIIIIXMMMXMMMIIIIIIIIIIMXXXXXXXXMXXMXXXXXMXMMXXMMIMXXXMMMIIMDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD while the optimal alignment CIGAR would have a score of -91 under this scoring scheme.

Not sure if I missed anything. Any help would be appreciated. Thanks!