vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

vg snarls detect node in backward direction from vg graph? #763

Open shilpagarg opened 7 years ago

shilpagarg commented 7 years ago

I constructed vg graph using freebayes VCF and why we detect snarls with node_id in backward direction? I would expect everything to be forward, is it not?

Example 1:
{"type": 1, "start": {"node_id": 83}, "end": {"node_id": 90}}
{"type": 1, "start": {"node_id": 85, "backward": true}, "end": {"node_id": 88, "backward": true}, "parent": {"start": {"node_id": 83}, "end": {"node_id": 90}}}
{"type": 1, "start": {"node_id": 90}, "end": {"node_id": 97}}

Example 2:
{"type": 1, "start": {"node_id": 500}, "end": {"node_id": 21315}}
{"type": 1, "start": {"node_id": 506, "backward": true}, "end": {"node_id": 21317, "backward": true}}
{"type": 1, "start": {"node_id": 506}, "end": {"node_id": 509}}
{"type": 1, "start": {"node_id": 509}, "end": {"node_id": 21318}}
{"type": 1, "start": {"node_id": 21320}, "end": {"node_id": 21321}}
{"type": 1, "start": {"node_id": 518, "backward": true}, "end": {"node_id": 21323, "backward": true}}
{"type": 1, "start": {"node_id": 518}, "end": {"node_id": 520}}
{"type": 1, "start": {"node_id": 520}, "end": {"node_id": 523}}

Is it some sort of bug? Or I am missing some logical details?

I also looked the node sequence 21317

ATATAAGATACGAAATAGGGGTTGATAATTGCATGACAGTAGCTTTAGATCAAAAAGGAAAGCATGGAGGGAAACAGTAAACAGTGAAAATTCTCTTGAGAACCAAAGTAAACCTTCAT

It is present in reference genome, why it is in backward? Moreover, why biallelic SnarlTraversals are in backward?

Example:
{"visits": [{"node_id": 4240, "backward": true}], "snarl": {"start": {"node_id": 4241, "backward": true}, "end": {"node_id": 21874, "backward": true}}}
{"visits": [{"node_id": 4239, "backward": true}], "snarl": {"start": {"node_id": 4241, "backward": true}, "end": {"node_id": 21874, "backward": true}}} 

where {"sequence": "G", "id": 4239},  {"sequence": "A", "id": 4240}

Here is the graph and its corresponding snarls: https://transfer.sh/13Kcwr/yeast.illumina.SK1_Y12.covall.chrI.freebayes.X.vg https://transfer.sh/OqVcB/yeast.illumina.SK1_Y12.covall.chrI.freebayes.X.xg https://transfer.sh/5k0nm/yeast.illumina.SK1_Y12.covall.chrI.freebayes.X.snarls

jeizenga commented 7 years ago

Snarls have an equivalent representation with both node's reversed and the start swapped with the end. It's probably not a bug, but you might want make a visualization of the graph to be sure.

shilpagarg commented 7 years ago

I would expect everything to be in forward direction because I constructed vg graph using VCF which is left to right.

Attached is example 1 picture. ex85.pdf

In case you are interested in more, you can just do vg find -n -c 10 -x to get subgraph, which is super simple. Please correct me if I am wrong. Thanks.

jeizenga commented 7 years ago

As far as I can tell, this is a case of the representational equivalence I referred to, not a bug. I'll be more specific. Both of these are equivalent Snarls:

{"type": 1, "start": {"node_id": 85, "backward": true}, "end": {"node_id": 88, "backward": true}}
{"type": 1, "start": {"node_id": 88}, "end": {"node_id": 85}}

The invariant is that the "start" points into the Snarl and the "end" points out of the Snarl. The strandedness of the Snarl is arbitrary.

shilpagarg commented 7 years ago

Most probably, we need right orientations in the assembly graph, instead arbitrary ones. For now, I can handle it for vg constructed from freebayes VCF because I know it is left to right. But we need to get the orientations right for assembly graphs. Do you agree?

edawson commented 7 years ago

You can just reverse the snarl - create a new snarl, set its end to the original's start and its start to the original's end, then add the contents to the new snarl in reverse order and set is_reverse to false. These two snarls are considered identical: direction in/out isn't a defining characteristic of a snarl. You can loop over all snarls in the graph to set them in the forward direction if you'd like. Snarls_main does this when outputting them in "sorted" order.

The orientations will matter for paths / SnarlTraversals when calling variants (as that will be coming from your reads, and they'd represent different things in forward/reverse).