vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

represent subpaths and path to path mappings #1559

Open ekg opened 6 years ago

ekg commented 6 years ago

vg has a problem representing subpaths. If we extract a subgraph and the overlapping subpaths then we don't understand when we get a path in several fragments.

This is also an issue for GFA thanks to the P namespace model that requires us to fully reify the path and it doesn't look like we can resolve it there. However, for vg it should be possible to develop a data model that makes it clear that a given path derives from another path.

The solution seems to be to add a property to the Path object that explains 1) which path we are a subpath of and 2) how this path maps to the path it is sub-pathing. In the true subpath pattern we would have only a start position and a length, but we might we might also want to consider path to path mappings with differences. These can be represented as vg::Path objects, using the name property of the Positions of the path Mappings to indicate the base path rather than the node_ids in the graph.

I am worried we are mixing several things here but also sensitive to providing a minimal solution to this problem that meets both needs.

JervenBolleman commented 6 years ago

As it is relevant for the RDF discussion as well adding some more comments.

Original graph with full path.

AAAA-CTG-TATA

Subgraph/path

CTG

In RDF full path would be.

step:1 :path path:full_path ; :node node:AAAA .
step:2 :path path:full_path ; :node node:CTG .
step:3 :path path:full_path ; :node node:TATA .

Subpath would normally just be

step:1 :path path:sub_path ; :node node:CTG .

But in this case there is no link to say that the subpath is a "subpath" of the fullpath. So here we would need to add new predicates, for example the following might work

step:1 :path path:sub_path ; :node node:CTG .

path:sub_path a :Subpath ; :subPathOf path:full_path ; :offset 4 .

(Note strings in RDF/SPARQL are index 1 based so we go along with this for ease of use)

In this string we say that the "sub_path" is a :subPathOf the "full_path" and that it starts after having seen 4 nucleotides in the full path.

This is enough information for us to correctly insert the subpath back into the graph, and would be a necessary step towards to efficiently describing both SNP and structural variations in VG RDF.