vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 194 forks source link

Snarl cutting is insufficient in the presence of snarls with variable-length traversals #2042

Open adamnovak opened 5 years ago

adamnovak commented 5 years ago

I have a read in my TOPmed hard reads set, seed_118_fragment_111607_1, that has a single T->C error and, in the primary reference, has a mappable 100-mer overlapping every base, and maps correctly against the primary graph. You should be able to reproduce this with something like this using my AWS snapshot of the TOPmed data. I did it with my branch https://github.com/vgteam/vg/pull/2025 but I think the underlying problem is present in master already:

vg filter -n seed_118_fragment_111607_ true-dataset.gam > test.gam

time vg mpmap -x ../final/chr17.xg -g ../final/chr17.gcsa -H ../final/chr17.gbwt -s ../final/chr17.snarls --single-path-mode --force-haplotype-count 108070 -I 565.107368 -D 164.628724 -B -i -G test.gam  2>log.bad.txt | vg annotate -p -x ../final/chr17.xg -a - > remapped.bad.gam

time vg mpmap -x NA12878/primary.xg -g NA12878/primary.gcsa --single-path-mode -I 565.107368 -D 164.628724 -B -i -G test.gam 2>log.good.txt | vg annotate -p -x NA12878/primary.xg -a - > remapped.good.gam

vg view -aj remapped.good.gam | jq -c '[.name, .score, .mapping_quality, .refpos]'

["seed_118_fragment_111607_1",156,22,[{"offset":"15680995","name":"NC_000017.11"}]]
["seed_118_fragment_111607_2",158,22,[{"offset":"15681101","is_reverse":true,"name":"NC_000017.11"}]]

vg view -aj remapped.bad.gam | jq -c '[.name, .score, .mapping_quality, .refpos]'

["seed_118_fragment_111607_1",146,21,[{"offset":"18701166","is_reverse":true,"name":"NC_000017.11"}]]
["seed_118_fragment_111607_2",153,21,[{"offset":"18701060","name":"NC_000017.11"}]]

It maps to the wrong place on the TOPmed graph, even with haplotype information, because the correct mapping ends up with the wrong local alignment, because even with snarl cutting on and the max cuttable snarl size turned all the way up, the MEMs take the wrong path through a snarl and the MultipathMapper is not able to fix it.

Basically, there's a 9 base deletion in the graph, and the T->C error in the read makes one of the read's two MEMs against the reference look like it takes the deletion edge before its last 3 bases. It then ends up abutting the other MEM in the reference, with 9 bases of non-MEM sequence between them in the read. The final alignment ends up saying that the read takes this deletion, and then later has an insertion of the same length, rather than just having the single-base error that is actually there.

Snarl cutting successfully cuts out parts of the anchor paths that are inside snarls, and can divide up the MultipathAlignmentGraph so you have different PathNodes before and after this deletion. But it doesn't do anything about the connectivity; there's still just one way to get through the MultipathAlignment: going through the PathNode sitting on the end node of the deletion snarl, with a perfect match to the erroneous base. I don't think there's any way to go around or skip over PathNodes, so the whole alignment gets pulled way off what the primary path/read diagonal would be by this bad anchor that snarl cutting doesn't remove.

I'm not quite sure how to fix this, but I hope that @jeizenga has some ideas. We might need to make snarl cutting prune out some PathNodes outside the snarl's contents, when there are noticeably different-length snarl traversals. Or we could just add a bunch more edges to the PathNode graph somehow and let the traceback on the MultipathAlignment work out that this is an expensive and terrible local alignment.

jeizenga commented 5 years ago

Seems to me this is a problem less with snarl cutting and more with using paths as anchors in a graph. I've seen similar alignment issues in the past, but I haven't come up with a particularly principled approach to this yet.

Perhaps we could trim the ends of anchors if they extend beyond a branch point by some small number of bases.

adamnovak commented 5 years ago

OK, I've investigated 10 randomly sampled reads that are wrong with nonzero MAPQ when mapping to the graph with haplotype-consistent traceback, and which are correct when mapping to the primary graph. In this sample, 8/10 of the wrong placements are caused by this problem. The mapper picks one way of aligning through an indel, and refuses to consider any others, even when this produces an obviously wrong local alignment of the read.

Read Reason
seed_127_fragment10161 bad local indel alignment
seed_97_fragment34745 missing good secondary hits
seed_136_fragment75707 bad local indel alignment
seed_119_fragment117043 bad local indel alignment
seed_129_fragment52542 bad local indel alignment
seed_144_fragment149457 bad local indel alignment (missing equally-scoring haplotype-consistent linearizations)
seed_120_fragment133784 bad local indel alignment (missing equally-scoring haplotype-consistent linearizations)
seed_99_fragment48366 bad local indel alignment
seed_142_fragment109360 better, wrong alignment to node not used in any haplotype
seed_91_fragment69364 bad local indel alignment

Sometimes the mapped decides to take a deletion or an insertion that is in the graph to avoid a SNP, and then forces the alignment to pay more than the SNP would cost for the compensating indel. In these cases the haplotype stuff is irrelevant; it produces a flat-out wrong local alignment just against the graph.

In other cases, it decides to take one branch of an indel instead of another to align the tail of a read, when the two options are indistinguishable from the read's point of view. If it picks the branch representing a very rare haplotype, this upsets haplotype consistency scoring, because the other, more haplotype consistent alignment is invisible to it.

I also had one case where a read was wrong because an incorrect hit had a SNP that matched an error in the read. But since that SNP in the graph had no haplotypes visiting it, we couldn't use haplotype scoring to reject the hit. We should probably remove all alt alleles that aren't used by haplotypes during graph construction.

The last read was wrong because we didn't get the same set of hits as we got against the primary graph. We got a bunch of very similar looking hits in either case; I think it could be mostly luck that we got it right against primary and didn't get a tied secondary against the graph.

@jeizenga I think we need to talk about some way to improve the anchor heuristic so we don't get bad local alignments like this. When are you back?

adamnovak commented 5 years ago

Some of the compensating indels are relatively distant from the indel they are fixing. I saw one on the other side of several SNP snarls.

adamnovak commented 5 years ago

OK, I've thought about this a bit, and I think that what we want to do is something like:

  1. Split anchors on node boundaries.

  2. Do a search out from the existing anchors, looking for all possible anchors that can be created without passing through more than X mismatches (or, allowing for gaps, without passing through edits with a cost of more than X). This will create anchors for all the tied placements of tails, as well as for places where the anchor took one path over another to avoid a SNP. If we allow for multiple mismatches, or small indels, it will account for cases where we took one path over another to avoid those as well. We shouldn't end up visiting any (ref base, read base) pairing more than once in any direction, if we prioritize the search queue by cost-to-reach.

  3. Reconstruct the reachability edges for the new anchors. I'm not sure how hard this will be, or if we can somehow preserve or carry over reachability information. Maybe we can just do all this before the reachability information is initially computed?

  4. Snarl cutting. This might be somewhat superfluous, since the search will also create some diversity of anchors within snarls. But I think we still need it.

  5. Continue with the rest of the existing algorithm.

Instead of being willing to take an in-graph indel and a compensating correcting indel of unlimited cost to avoid 0 cost in the alignment of the anchoring sequence (i.e. because the original anchor picked the wrong branch of a tie), we would be able to guarantee that we won't take an in-graph indel and compensating indel unless it's "worth it". In the gapless case, "worth it" would mean that it would have to let us avoid X+1 mismatches. We'd still be computing local alignments as if an indel of any length was worth X+1 mismatches, but that would be better than what we have now, where an indel of any length appears free when the anchors happen to force it.

We'd also want to allow tails to be aligned not just from the outermost anchors, but also from other nearby anchors, especially if the last anchor goes into a snarl. One of the reads I looked at had one of its placements look bad in light of the haplotypes because its final base was a mismatch, and it got placed on a rare insert that happened to match it. I don't think the aligner was able to consider having it be softclipped or treated as a mismatch against the next common base, when it was generating alternative tracebacks.