vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 191 forks source link

Fix misplaced indel edits #4320

Closed adamnovak closed 4 days ago

adamnovak commented 4 days ago

If you map read S1_73477 from the 1m simulated R10 reads with vg ef2a1384b, like this:

READ_NAME=S1_73477
vg filter --exact-name -n "${READ_NAME}" /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/reads/sim/r10/HG002/HG002-sim-r10-1m.gam >read.truth.gam
GRAPH_BASE=/private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9
MINPARAMS=k31.w50.W
vg giraffe -t16 --parameter-preset r10 --track-provenance --track-correctness --progress --show-work -G read.truth.gam -Z ${GRAPH_BASE}.gbz -d ${GRAPH_BASE}.dist -m ${GRAPH_BASE}.${MINPARAMS}.withzip.min -z ${GRAPH_BASE}.${MINPARAMS}.zipcodes -x ${GRAPH_BASE}.xg >remapped.gam 2>log-${READ_NAME}.txt

Then the alignment doesn't validate:

vg validate -A -a remapped.gam /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.gbz

Invalid Alignment:

Length of node 4465278 (1) exceeded by Mapping with offset 0 and from-length 4:
{"edit": [{"from_length": 1, "to_length": 1}, {"sequence": "AAGG", "to_length": 4}, {"from_length": 3}], "position": {"node_id": "4465278"}, "rank": "63"}
alignment: invalid

This is because a "from_length": 3 deletion edit that might make sense in a mapping to node 4465277 is smushed into the mapping for the previous node, node 4465278. It also happens to be immediately after an insertion edit; generally we don't want to abut those.

We have to figure out how this base-level alignment is being generated, and at least get the edit into the right mapping, if not prohibit the adjacent indels entirely.

Without this, surject can't process the read.

adamnovak commented 4 days ago

This is fixed in 1b52afedc4de8835cefcc472e2faf05c3781398a..3277498e2768962fb799975c26f55696ca4ffcfd on the lr-giraffe branch.