vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

banded global aligner seems to prefer huge gaps at the ends of reads over matches #531

Open ekg opened 7 years ago

ekg commented 7 years ago

@jeizenga maybe you have some insight into this. I've spent the past day trying to figure it out from the side of the mapper itself and I'm now convinced that the problem is in the realm of the banded global aligner.

Most of the discrepancy between the new and old alignment algorithms is now down to this strange "soft clip avoidance" artifact. It looks like the banded global aligner prefers long gaps at the ends of reads over SNPs or even matches (as in this case). Hopefully this example will explain what I mean.

Get the graph:

echo '{"node": [{"sequence": "T", "id": 8296}, {"sequence": "G", "id": 8298}, {"sequence": "A", "id": 8299}, {"sequence": "CCTCC", "id": 8300}, {"sequence": "C", "id": 8301}, {"sequence": "T", "id": 8302}, {"sequence": "A", "id": 8303}, {"sequence": "G", "id": 8304}, {"sequence": "GGTT", "id": 8305}, {"sequence": "G", "id": 8306}, {"sequence": "C", "id": 8307}, {"sequence": "A", "id": 8308}, {"sequence": "A", "id": 8309}, {"sequence": "T", "id": 8310}, {"sequence": "G", "id": 8311}, {"sequence": "C", "id": 8312}, {"sequence": "T", "id": 8313}, {"sequence": "GATTCTCCTGCCTCAGCCTCC", "id": 8314}, {"sequence": "CAAGTAGCTGAGATTACAGGT", "id": 8315}, {"sequence": "G", "id": 8316}, {"sequence": "C", "id": 8317}, {"sequence": "T", "id": 8318}, {"sequence": "CCGCCACCACACCCAGCTAATTTTT", "id": 8319}], "edge": [{"from": 8296, "to": 8298}, {"from": 8296, "to": 8299}, {"from": 8298, "to": 8300}, {"from": 8299, "to": 8300}, {"from": 8300, "to": 8301}, {"from": 8300, "to": 8302}, {"from": 8301, "to": 8303}, {"from": 8302, "to": 8303}, {"from": 8301, "to": 8304}, {"from": 8302, "to": 8304}, {"from": 8303, "to": 8305}, {"from": 8304, "to": 8305}, {"from": 8305, "to": 8306}, {"from": 8305, "to": 8307}, {"from": 8306, "to": 8308}, {"from": 8307, "to": 8308}, {"from": 8308, "to": 8309}, {"from": 8308, "to": 8310}, {"from": 8309, "to": 8311}, {"from": 8310, "to": 8311}, {"from": 8311, "to": 8312}, {"from": 8311, "to": 8313}, {"from": 8312, "to": 8314}, {"from": 8313, "to": 8314}, {"from": 8314, "to": 8315}, {"from": 8315, "to": 8316}, {"from": 8316, "to": 8317}, {"from": 8316, "to": 8318}, {"from": 8317, "to": 8319}, {"from": 8318, "to": 8319}]}' | vg view -Jv - >k3.vg

Align with banded alignment (you'll need the changes from the map-good branch to make this work):

vg view -dA <(vg align -s ACCTCCCAGGTTCATGTGATTCTCCT -b k3.vg ) k3.vg | dot -Tpng -o x.png

image

Here's the regular GSSW alignment algorithm:

vg view -dA <(vg align -s ACCTCCCAGGTTCATGTGATTCTCCT k3.vg ) k3.vg | dot -Tpng -o y.png

image

Usually the global aligner seems to be helping. But it has this odd quirk. Any ideas?

jeizenga commented 7 years ago

@ekg Can you also get the list of MEMs it's threading? I have a feeling that this is a short MEM that randomly matched something a little ways outside of the correct alignment. If so, the banded aligner will still try to connect the MEMs at whatever cost is necessary. It doesn't really make any attempt to decided if the MEM is reasonable.

ekg commented 7 years ago

Initially I thought this too, but it isn't what's happening. This example it isn't using indexes or MEMs. It is just aligning the sequence against the graph with the pairwise alignment algorithms. Note the use of vg align in the examples. The -b flag currently exists only on my branch. I'll try to pull even with master today so it'll be easier for you to test.

ekg commented 7 years ago

I get that the global mapper will produce a full-length alignment through the graph (which is why it's so useful!) but here it looks like there is some edge condition that's causing a single-base match to be registered at the end of the graph. Is this clear from the first example rendering? It may be hard to see so you'll need to check out the image and run the example.

jeizenga commented 7 years ago

Ah, I see. Well, I think this is fundamentally the same sort of problem. The banded global aligner will stretch the alignment to a sink node and a source node, so there really isn't such thing as a soft clip. Because of that, it's indifferent to the alignment produced by gssw and the one you don't like. The global aligner pays for the loooong gap in either scenario, so that last T match could just as easily come after it as before it.

Right now the traceback is written to break tied scores arbitrarily in favor of matches. In situations like this one where a match can occur in two places with equal score, this tie-breaking strategy will place the match in the position farthest to the right. I'm pretty sure that's what's causing this pattern.

ekg commented 7 years ago

There is no difference in the number of mismatches for the two solutions--- the local alignment finds a perfect match for that last base without a gap.

The global banded alignment is basically opening two "gaps" (one for the long gap and the other at the end of the read) while the local alignment only opens one (the end of the read). That looks like unintended behavior. Is the banded alignment forced to pay a gap penalty at the ends? Could that harmonize this result?

I can post-process to detect and modify this case into a soft clip. I have tried to change the way that the graph is collected to make sure there isn't a long unalignable chunk at the end but it is not trivial to do.

I can provide you with about a hundred examples of this issue from 10k reads. It isn't always in cases where the alignment without the gap has more mismatches.

On Fri, Nov 4, 2016, 04:39 Jordan Eizenga notifications@github.com wrote:

Ah, I see. Well, I think this is fundamentally the same sort of problem. The banded global aligner will stretch the alignment to a sink node and a source node, so there really isn't such thing as a soft clip. Because of that, it's indifferent to the alignment produced by gssw and the one you don't like. The global aligner pays for the loooong gap in either scenario, so that last T match could just as easily come after it as before it.

Right now the traceback is written to break tied scores arbitrarily in favor of matches. In situations like this one where a match can occur in two places with equal score, this tie-breaking strategy will place the match in the position farthest to the right. I'm pretty sure that's what's causing this pattern.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/531#issuecomment-258337862, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4ERpBrvhgSp1vZVa_0wRgEqc8Fam4ks5q6qjagaJpZM4KnedN .

jeizenga commented 7 years ago

The banded aligner is required to pay for the two gaps, but that behavior is absolutely intended. If we allow it to soft clip for free then it won't accurately connect MEMs in the threader, which is the use case it was designed for. It has to be able to take negative scores on either end of the alignment (in fact, this will be the normal case since the bases at the start and end of the subgraph should have broken a MEM). The gap penalty at the ends is necessary.

I'm not sure what your goals are for the vg align interface, but post-processing would be an option. That said, I'm tempted to say that if you want soft clips you shouldn't be using a global aligner in the first place. If you want soft clips on one end, that's what the pinned aligner is for.

It might help me to see how this is actually happening in a read to see why this is a problem. Right now this looks like it's working how it's supposed to.

ekg commented 7 years ago

@jeizenga sorry, I think I'm confusing things.

The problem here is really particular and isn't exactly related to soft clipping. GSSW finds an exact match for the entire pattern against the graph which is completely contiguous.

This last mapping from the local alignment is what I'm referring to:

vg align -s ACCTCCCAGGTTCATGTGATTCTCCT k3.vg -j | jq . 
...
      {
        "position": {
          "node_id": 8314
        },
        "edit": [
          {
            "from_length": 9,
            "to_length": 9
          }
        ],
        "rank": 11
      }
...

It has no mismatches. Just 9 matches and it's done.

image

If we map with the banded aligner we get this:

image

And our last mapping doesn't include the possible match.

vg align -s ACCTCCCAGGTTCATGTGATTCTCCT k3.vg -b -j | jq .
...
      {
        "position": {
          "node_id": 8314
        },
        "edit": [
          {
            "from_length": 8,
            "to_length": 8
          },
          {
            "from_length": 13
          }
        ],
        "rank": 12
      },
....

The deletion goes on for a while and then we get a 1bp match.

I understand that the ends of the alignments will be deletions given that we're working with a global alignment mode and my read is shorter than the path through the reference graph from source to sink. That's cool. I also (happily) accept that we don't have soft clipping, and see that this could be added in post processing if it seems sensible.

However, what's confusing me is that it seems the global aligner is preferring an alignment with an embedded gap when there is a perfect match through the read that's available. Intuitively, it seems this is because the gap must be opened and extended anyway.

If there were a penalty for closing the gap, the alignment with the sequence matched contiguously might outweigh the one found by the current implementation.

Encouraging a contiguous mapping (especially where there are no mismatches or gaps in a contiguous mapping) is really critical to the application of vg to short reads. Not having this is going to require some workaround hacks in the mapper itself. Unless we can exactly guess where in the graph a read end will map, many of them are going to tell a very confusing story--- and that's going to be true for all sequences, not just short ones.

I think I see how the banded aligner would have to change to fix this. Is the gap close penalty a sensible idea?

jeizenga commented 7 years ago

What I'm trying to say is that from the perspective of global alignment, those alignments are actually truly correctly equivalent. They have the same number of matches, they have the same number of gaps, and their gaps are the same lengths, so they have the same score. This isn't a mistake. You can think of this as the same problem as where to put a gap in a homopolymer (in this case two T's), except that global alignments allow the ambiguous placement to play out over relatively large distances.

The banded global aligner left-justifies gaps consistently, which means that it will always choose to move gaps further to the left into the read when there are ties. In the traceback loop, you can only left-justify gaps or right-justify gaps. You seem to want to center-justify the gaps. If that's really something you want, you will need to post-process the alignments. There's no way to do it in the traceback before you know what the whole alignment looks like.

If anything, I think this highlights that the global aligner isn't really meant to align full-length reads to parts of the graph. You need to know that the read aligns to the full length of the graph as well. If you're expecting it to act like a local aligner, you probably shouldn't be using the global aligner.

I think it might help to put this test in the context of its intended use case. To do that, you have to imagine MEMs on either side of this global alignment. Throw a MEM on the right and that lonely T-match is now part of a string of matches, so the discontinuous patchy appearance goes away. It's still true that the T could be part of either string of matches, but either of these two alignment will "look okay".

ekg commented 7 years ago

I totally get the behavior and don't think it's a bug in the implementation. It works perfectly when we are embedded in between two matches. I'm already using it there.

In my hands, the local alignment of read ends is demonstrably worse than using the global aligner. So I'd really like to enable the global aligner for all DP in the mapper. This is usually working fine except using the global aligner for the tails of the read exposes the mapper to the problem described here.

Conceptually, the problem is that we are trying to align the reads against the entire genome, but the global aligner is not aware of this and so provides a global solution against only the fragment we've given it. As such, it isn't aware of the fact that a new gap must open beyond the end of the graph it's aligning to.

If there were a way to communicate that the graph extends beyond the piece we've given and that the piece we're aligning is the end of a read, then it seems logical that the global aligner could find alignment solutions that wouldn't fragment the read in this way.

I'll look at ways to resolve this through an extension to the global banded aligner that won't affect its usual behavior. I'm curious if @richarddurbin has any suggestions.

richarddurbin commented 7 years ago

You shouldn't be using a global alignment algorithm that requires going to the end of the graph. You should be using a variant of dynamic programming that requires going to the end of the read, but not the graph. This is called global-local in some parlance, but in in fact described I think in the original Needleman-Wunsch paper that people refer to for global alignment. The HMMER package is very clear about this - they look for complete matches to HMMs, that are local in the target sequence, e.g. full protein domains within protein sequences. See "Overlap matches" section on page 26 of our book. Full global in both graph and read is appropriate between MEMs. At the ends you want to allow the read to finish anywhere in the graph.

jeizenga commented 7 years ago

Oh, I didn't know we were talking about read ends. That's actually what I implemented the "pinned alignment" functions for. However, what I did is not quite what Richard is describing. The pinned alignment considers one end of the read and one node position to be fixed (in a global alignment sort of sense) and then allows soft clips on the other end. The idea was that you would be anchoring it at a MEM somewhere.

ekg commented 7 years ago

@richarddurbin quoting the section on overlap matches on page 27 from Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids:

Another type of search is appropriate when we expect that one sequence contains
the other, or that they overlap. This often occurs when comparing fragments
of genomic DNA sequence to each other, or to larger chromosomal sequences.
Several different types of configuration can occur, as shown here:

image

What we want is really a type of global alignment, but one that does not penalise
overhanging ends. This gives a clue to what sort of algorithm to use: we want
a match to start on the top or left border of the matrix, and finish on the right
or bottom border. The initialisation equations are therefore that F(i,0) = 0 for
i = 1,... ,n and F(0, j) = 0 for j = 1,... ,m, and the recurrence relations within
the matrix are simply those for a global alignment (2.8). We set Fmax to be the
maximum value on the right border (i,m),i = 1,...,n, and the bottom border
(n,j),j = 1,... ,m. The traceback starts from the maximum point and continues
until the top or left edge is reached.

@jeizenga it does seem like implementing these changes in the traceback and adding the 0-score row and column an the beginning will be tricky given the banding. What do you think?

jeizenga commented 7 years ago

It will definitely take some work. I have extra logic for the first column/row already to handle the case of a lead gap, so at least some of the edge cases you'd have to handle are already broken out at least. My intuition is that we'd have some work to do avoiding weird edge cases related to how you extract the subgraph too.

That said, I will point out that switching to global-local alignment is actually a pretty big conceptual shift. Basically we're giving up on soft/hard clips. I see the logic in doing that--ideally the entirety of the read should align somewhere in the graph. However, there are reasons to stick to the local alignment strategy we have been using. Next-gen reads are often shitty at the ends, so it can help mapping to give up at aligning them (although the quality adjusted alignments go some way towards remedying this problem). The situation is worse for chimeric reads and novel structural rearrangements, which will yield really horrible alignments if you force the entire read to align with a colinear algorithm. Finally, on the practical side, the pinned alignment that mimics local alignment for the MEM threader is already implemented and tested. Altering the banded aligner will take a lot more work.

ekg commented 7 years ago

Is the pinned alignment for the global banded algorithm working? In my hands it makes no difference or makes things worse. I'll investigate.

jeizenga commented 7 years ago

Hmm. Yeah, it's a little hard for me to comment without knowing more specifically how it's making things worse. We may have some debugging to do, but I do think that the pinned alignment is the natural way to mimic local alignment for the MEM threader.

ekg commented 7 years ago

I've basically been stuck on this. It's not reasonable to resolve this using post-processing. Using local alignment causes a significant decrease in performance. I'm going to just accept the artifact.

@jeizenga we should talk soon about how I can implement an update to the banded global aligner that will resolve it.

jeizenga commented 7 years ago

Okay. Are you planning on being at the vg meeting tomorrow? Also, do you have some examples of the decreased performance using the pinned alignment / instructions for how I could reproduce it? I'm really confused about how it could be hurting us--unless there's a bug in there somewhere.

ekg commented 7 years ago

@jeizenga it introduces a deletion into a huge fraction of the alignments which would have soft clips. This means they get much lower scores than they should (often they go negative), yielding errors in alignment selection. We will have to be careful to post-process the reads to handle this artifact, which is probably harder to do right than fixing it up front.

Meanwhile, our soft clipping is pretty aggressive in gssw, and so we miss good alignments (because they score very low due to e.g. a single mismatch).

The example I posted here is 100% representative of the issue. It should be easy to generate other examples... I have a few hundred handy.

I'm trying to merge even with master and drop this there so it'll be easier to test. Right now you can't easily work from the map-good branch. At least, I'm having a really hard time getting builds to work locally due to a lot of updates in the makefile.

jeizenga commented 7 years ago

@ekg The example here is using the banded aligner though, correct? I'm talking about the pinned alignment. The pinned aligner is run through gssw, so it should never return negative scores. If it is, there's definitely a bug.

ekg commented 7 years ago

@jeizenga the alignment here is using the un-banded and banded alignment without the pinning. It requires the version of vg align that's in the mem-threader branch to work. GSSW is not returning negative scores, only the banded aligner. Pinning with GSSW does seem to help slightly, but it still performs worse than pinning + banded global.