rvaser / spoa

SIMD partial order alignment tool/library
MIT License
158 stars 32 forks source link

Unexpected multiple sequence alignment #7

Closed nh13 closed 3 years ago

nh13 commented 6 years ago

I am seeing an unexpected MSA and consensus sequence. By eye, I can see a more parsimonious result. See below for details. Any insight would be appreciated.

Actual Output ``` Consensus (142) CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT Multiple sequence alignment CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC------------------------------------ CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGG------------------C----------------------------------- CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCGGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGG------------------CCTG-------------------------------- CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAAC----AGCCAGGCTATCCCCAGCCCTTACCGGCGTGT CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAAC----AGCCAGGCTATCCCCAGCCCTTACCGGCGTGT ```
Expected Output ``` Consensus (142) CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGGC----TATCCCCAGCCCTTACCGGCGTGT Multiple sequence alignment CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC-------------------------------- CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGC------------------------------------------------- CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCGGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTG---------------------------------------------- CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT ```
Source: example.cpp ``` #include "spoa/spoa.hpp" int main(int argc, char** argv) { std::vector sequences = { "CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTGCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC", "CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGC", "CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCGGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTG", "CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT" }; auto alignment_engine = spoa::createAlignmentEngine(static_cast(atoi(argv[1])), atoi(argv[2]), atoi(argv[3]), atoi(argv[4])); auto graph = spoa::createGraph(); for (const auto& it: sequences) { auto alignment = alignment_engine->align_sequence_with_graph(it, graph); graph->add_alignment(alignment, it); } std::string consensus = graph->generate_consensus(); fprintf(stderr, "Consensus (%zu)\n", consensus.size()); fprintf(stderr, "%s\n", consensus.c_str()); std::vector msa; graph->generate_multiple_sequence_alignment(msa, true); fprintf(stderr, "Multiple sequence alignment\n"); for (const auto& it: msa) { fprintf(stderr, "%s\n", it.c_str()); } return 0; } ``` Spoa commit : 783d7b6925375c8fe486c03d2f271bf65b7dc6f5 Compiled with `g++ example.cpp -std=c++11 -Iinclude/ -Lbuild/lib/ -lspoa -o example` Run with `./example 0 5 -4 -8`
rvaser commented 6 years ago

Hi Nils, the MSA is correct because you are running local alignment. You might get your expected output if the order of sequences added to spoa is different (1-4-2-3) or try global alignment.

Best regards, Robert

nh13 commented 6 years ago

Ah, I missed that the leading zero on the command line was specifying local alignment!

nh13 commented 6 years ago

@rvaser where's the best place to understand the difference between local, global, and semi-global (glocal)? I understand them for pairwise alignment having written many short-read mappers myself, but obviously not for MSA.

nh13 commented 6 years ago

Also, I tried with a different order and it gave me my expected output. Using global or glocal did not. I guess I’ll have to go back to the Lee at al papers to understand why. For my purposes, it is clearly better to have my expected output. Perhaps I need to post process by doing pairwise alignment to the consensus.

nh13 commented 6 years ago

A good overview of POA: http://simpsonlab.github.io/2015/05/01/understanding-poa/ but o still want to understand local/global/glocal for POA. Likely the issue is the topological sort and the order of input sequences.

rvaser commented 6 years ago

The simpsonlab tutorial is great! Although topological sort can have impact on the resulting consensus, the order of input sequences is the main culprit. Try the Lee at al papers :)

nh13 commented 6 years ago

@rvaser I still don't get why my expected output isn't preferred.

Not counting the trailing dashes ('-'), the end of the alignments all match the consensus sequence, with only the first sequence having two mismatches at the end (ACAG instead of CCTG). The actual output not only has mismatches in the first sequence but long deletions (dashes/'-') in second and third sequence. Should these internal deletions be penalized when the consensus doesn't contain them? Then they have in insertion (relative to the consensus) which should also be penalized. Would you not agree the expected output I show is a more parsimonious MSA?

Take a look at global alignment:

./example 2 5 -4 -8
Consensus (136)
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT
Multiple sequence alignment
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC---------------------------------
CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGT-----G-G-----------C---------------------------------
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCGGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGT-----G-G-----------CCTG------------------------------
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCA-ACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGT-----G-GCTCCTTACCA-ACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT

From this discussion, what I realize I want is to call the consensus sequence using the local algorithm, and then go back and perform pairwise alignments of the input sequences back to the consensus in a glocal manner (align the full sequences to a sub-sequence of the consensus). The partial order graph would be constructed from those pairwise alignments (I think).

nh13 commented 6 years ago

@rvaser thank-you kindly for bearing with me in this discussion. I implemented my idea of re-doing the MSA after calling the consensus, but performed local alignment each time.

example2.cpp ```c++ #include "spoa/spoa.hpp" std::string create_consensus(std::unique_ptr &alignment_engine, const std::vector &sequences) { auto graph = spoa::createGraph(); for (const auto& it: sequences) { auto alignment = alignment_engine->align_sequence_with_graph(it, graph); graph->add_alignment(alignment, it); } return graph->generate_consensus(); } void redo_msa(std::unique_ptr &alignment_engine, std::vector &sequences, const std::string &consensus) { auto graph = spoa::createGraph(); sequences.insert(sequences.begin(), consensus); for (const auto& it: sequences) { auto alignment = alignment_engine->align_sequence_with_graph(it, graph); graph->add_alignment(alignment, it); } fprintf(stderr, "Consensus (%zu)\n", consensus.size()); fprintf(stderr, "%s\n", consensus.c_str()); std::vector msa; graph->generate_multiple_sequence_alignment(msa, false); fprintf(stderr, "Multiple sequence alignment\n"); for (const auto& it: msa) { fprintf(stderr, "%s\n", it.c_str()); } } int main(int argc, char** argv) { std::vector sequences = { "CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTGCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC", "CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGC", "CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCGGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTG", "CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT" }; auto alignment_engine = spoa::createAlignmentEngine(static_cast(atoi(argv[1])), atoi(argv[2]), atoi(argv[3]), atoi(argv[4])); std::string consensus = create_consensus(alignment_engine, sequences); redo_msa(alignment_engine, sequences, consensus); return 0; } ``` `g++ testing.cpp -std=c++11 -lspoa -o testing && ./testing 0 5 -4 -8` `./testing 0 5 -4 -8`

Output:

Consensus (142)
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT
Multiple sequence alignment
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC--------------------------------
CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGC-------------------------------------------------
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCGGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTG----------------------------------------------
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT
rvaser commented 6 years ago

The problem with global alignment on this particular example is that my backtrack prefers matches over indels (in all alignment modes) and picks the last C character aligned to the end of first sequence (your expected alignment is suboptimal, although there is a "better" alignment with the same score). See bellow:

Printed output:

CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC
CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGT-----G-G-----------C

Your expected output with a smaller score due to the mismatch A-C on last character:

CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC
CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGC-----------------

"Better" alignment with the same score as the outputed:

CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC
CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGG-C----------------

You can also do something like this (faster approach): use a fast mapper (minimap) to find begin/end locations of each individual sequence to the backbone (first sequence) and then run global alignment on the subgraph determined by the begin/end locations. I am not sure if that is suitable for your use case, but that is what we are using in racon.

rvaser commented 6 years ago

Also, Gotoh alignment might work better on this example but I removed it from spoa in the last big update. You can use commit 79c8f1db59e04829c3ebb79b35baa575b33933b8 and try it out (the old version is slower though).

nh13 commented 6 years ago

@rvaser yes, affine gap penalties make a lot more sense from my perspective. Was it just a speed issue?

rvaser commented 6 years ago

It uses more memory due to more matrices that are needed.

nh13 commented 6 years ago

Do you find that in nanopore data that affine gap penalties aren't that useful for modeling error?

rvaser commented 6 years ago

I am not sure. Given enough amount of data, affine gap penalties and simple gaps had approximately the same accuracy in our consensus tool so we decided for the latter.

ekg commented 5 years ago

It uses more memory due to more matrices that are needed.

Have you considered adding affine gaps?

The memory use is an issue, but it's at least a constant factor (+2x the memory if you store only the main score matrix).

@nh13 we've implemented this in https://github.com/vgteam/gssw, but that is only the alignment component, not the POA generator (which is given by vg msga).

rvaser commented 5 years ago

Yes, affine gaps are now implemented. Convex (piece-wise affine) as well!

Best regards, Robert

ekg commented 5 years ago

@rvaser awesome!