Open ekg opened 6 years ago
As it is relevant for the RDF discussion as well adding some more comments.
Original graph with full path.
AAAA-CTG-TATA
Subgraph/path
CTG
In RDF full path would be.
step:1 :path path:full_path ; :node node:AAAA .
step:2 :path path:full_path ; :node node:CTG .
step:3 :path path:full_path ; :node node:TATA .
Subpath would normally just be
step:1 :path path:sub_path ; :node node:CTG .
But in this case there is no link to say that the subpath is a "subpath" of the fullpath. So here we would need to add new predicates, for example the following might work
step:1 :path path:sub_path ; :node node:CTG .
path:sub_path a :Subpath ; :subPathOf path:full_path ; :offset 4 .
(Note strings in RDF/SPARQL are index 1 based so we go along with this for ease of use)
In this string we say that the "sub_path" is a :subPathOf
the "full_path" and that it starts after having seen 4 nucleotides in the full path.
This is enough information for us to correctly insert the subpath back into the graph, and would be a necessary step towards to efficiently describing both SNP and structural variations in VG RDF.
vg has a problem representing subpaths. If we extract a subgraph and the overlapping subpaths then we don't understand when we get a path in several fragments.
This is also an issue for GFA thanks to the
P
namespace model that requires us to fully reify the path and it doesn't look like we can resolve it there. However, for vg it should be possible to develop a data model that makes it clear that a given path derives from another path.The solution seems to be to add a property to the Path object that explains 1) which path we are a subpath of and 2) how this path maps to the path it is sub-pathing. In the true subpath pattern we would have only a start position and a length, but we might we might also want to consider path to path mappings with differences. These can be represented as
vg::Path
objects, using thename
property of thePosition
s of the pathMapping
s to indicate the base path rather than thenode_id
s in the graph.I am worried we are mixing several things here but also sensitive to providing a minimal solution to this problem that meets both needs.