Closed nh13 closed 3 years ago
Hi Nils,
the MSA is correct because you are running local alignment. You might get your expected output if the order of sequences added to spoa
is different (1-4-2-3) or try global alignment.
Best regards, Robert
Ah, I missed that the leading zero on the command line was specifying local alignment!
@rvaser where's the best place to understand the difference between local, global, and semi-global (glocal)? I understand them for pairwise alignment having written many short-read mappers myself, but obviously not for MSA.
Also, I tried with a different order and it gave me my expected output. Using global or glocal did not. I guess I’ll have to go back to the Lee at al papers to understand why. For my purposes, it is clearly better to have my expected output. Perhaps I need to post process by doing pairwise alignment to the consensus.
A good overview of POA: http://simpsonlab.github.io/2015/05/01/understanding-poa/ but o still want to understand local/global/glocal for POA. Likely the issue is the topological sort and the order of input sequences.
The simpsonlab tutorial is great! Although topological sort can have impact on the resulting consensus, the order of input sequences is the main culprit. Try the Lee at al papers :)
@rvaser I still don't get why my expected output isn't preferred.
Not counting the trailing dashes ('-'
), the end of the alignments all match the consensus sequence, with only the first sequence having two mismatches at the end (ACAG
instead of CCTG
). The actual output not only has mismatches in the first sequence but long deletions (dashes/'-'
) in second and third sequence. Should these internal deletions be penalized when the consensus doesn't contain them? Then they have in insertion (relative to the consensus) which should also be penalized. Would you not agree the expected output I show is a more parsimonious MSA?
Take a look at global alignment:
./example 2 5 -4 -8
Consensus (136)
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT
Multiple sequence alignment
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC---------------------------------
CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGT-----G-G-----------C---------------------------------
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCGGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGT-----G-G-----------CCTG------------------------------
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCA-ACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGT-----G-GCTCCTTACCA-ACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT
From this discussion, what I realize I want is to call the consensus sequence using the local algorithm, and then go back and perform pairwise alignments of the input sequences back to the consensus in a glocal manner (align the full sequences to a sub-sequence of the consensus). The partial order graph would be constructed from those pairwise alignments (I think).
@rvaser thank-you kindly for bearing with me in this discussion. I implemented my idea of re-doing the MSA after calling the consensus, but performed local alignment each time.
Output:
Consensus (142)
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT
Multiple sequence alignment
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC--------------------------------
CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGC-------------------------------------------------
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCGGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTG----------------------------------------------
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT
The problem with global alignment on this particular example is that my backtrack prefers matches over indels (in all alignment modes) and picks the last C character aligned to the end of first sequence (your expected alignment is suboptimal, although there is a "better" alignment with the same score). See bellow:
Printed output:
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC
CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGT-----G-G-----------C
Your expected output with a smaller score due to the mismatch A-C on last character:
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC
CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGC-----------------
"Better" alignment with the same score as the outputed:
CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC
CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGG-C----------------
You can also do something like this (faster approach): use a fast mapper (minimap) to find begin/end locations of each individual sequence to the backbone (first sequence) and then run global alignment on the subgraph determined by the begin/end locations. I am not sure if that is suitable for your use case, but that is what we are using in racon.
Also, Gotoh alignment might work better on this example but I removed it from spoa in the last big update. You can use commit 79c8f1db59e04829c3ebb79b35baa575b33933b8 and try it out (the old version is slower though).
@rvaser yes, affine gap penalties make a lot more sense from my perspective. Was it just a speed issue?
It uses more memory due to more matrices that are needed.
Do you find that in nanopore data that affine gap penalties aren't that useful for modeling error?
I am not sure. Given enough amount of data, affine gap penalties and simple gaps had approximately the same accuracy in our consensus tool so we decided for the latter.
It uses more memory due to more matrices that are needed.
Have you considered adding affine gaps?
The memory use is an issue, but it's at least a constant factor (+2x the memory if you store only the main score matrix).
@nh13 we've implemented this in https://github.com/vgteam/gssw, but that is only the alignment component, not the POA generator (which is given by vg msga).
Yes, affine gaps are now implemented. Convex (piece-wise affine) as well!
Best regards, Robert
@rvaser awesome!
I am seeing an unexpected MSA and consensus sequence. By eye, I can see a more parsimonious result. See below for details. Any insight would be appreciated.
Actual Output
``` Consensus (142) CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT Multiple sequence alignment CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC------------------------------------ CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGG------------------C----------------------------------- CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCGGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGG------------------CCTG-------------------------------- CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAAC----AGCCAGGCTATCCCCAGCCCTTACCGGCGTGT CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAAC----AGCCAGGCTATCCCCAGCCCTTACCGGCGTGT ```Expected Output
``` Consensus (142) CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGGC----TATCCCCAGCCCTTACCGGCGTGT Multiple sequence alignment CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCATCCACCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGCGGGCGCTGTGGACAGCGCTCCTTACCACC-------------------------------- CCCGCCCCTGAAAGCCTTCGCGCCCGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGC------------------------------------------------- CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCGGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTG---------------------------------------------- CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTG-CCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT CCCGCCCCTGAAAGCCTTCGCGCACGCTGCCCCTTCCTCCAGGCTGCCCCCCCTGGATGGACGGAGCAGCAGGGCCAGCGAGGGCGCTGTGGCCTGAGCTCCTTACCAACAGCCAGGCTATCCCCAGCCCTTACCGGCGTGT ```Source: example.cpp
``` #include "spoa/spoa.hpp" int main(int argc, char** argv) { std::vector