Open JervenBolleman opened 6 years ago
@JervenBolleman Thanks for your comment in #1557, I have understood that RDF should have fully embedded paths and nodes. But, I think the usefulness of describing variants in an easy way remains. I am not sure how to connect VG RDF with the existing mutation dataset which is represented as the "difference" between the variant and a reference: has it already been discussed? Moreover, I am interested in whether each variant on the embedded path can be annotated if the path includes many variants. A diff-style representation of variants like V tag or transition is one solution but the inter-chromosomal variants are hard to describe. In any case, I think we should discuss with some examples.
Possible solution using the translation/diff concept, continues the example from above. Here it is a small change matching a whole node.
path:patient_1 :translation [
:source path:reference ;
:change [ :remove step:reference-2 ;
:insert step:patient-2 ;
:before step:reference-3 ;
:after step:reference-1 ;
]
]
I think this is the minimal example, where there is one change that removes a step and reinserts another one. It does not need to be step which is inserted here but it can be another :Path
. A single :change
can remove more than one :step
. As an aside the :translation
and :change
are blank nodes and I am not sure if these should ever be identifiable outside of the document context.
Regarding the really big variants, I think life is probably easiest when fully inserting those into the VG and have them as whole "paths".
My thought is that small variants such as presentable as VCF benefit from an RDF notation. Working with large structural variation benefits too much from VG style embedding and gBWT that I don't think it makes much sense to have them as translations but should just be full paths when exposed as RDF. Tradeoff is simplicity of queries to demands on storage. Storage demands can be hacked, but difficult queries will hurt for ever.
I generally agree with the idea that variations should be embedded in paths finally though there are some problems such as how to determine "full length of the path of SV" for real data. So, we should clarify the role of :translation
and :path
to prevent from confusing which to use.
This is just an idea and I am not sure it is good or bad but I wonder if the :translation
can be represented with a kind of SNP ontology and connect with the existing SNP databases using it.
The minimal example looks nice. Small insertion, deletion, and polymorphism matching a whole node can be represented with :translation.
Next, we should take more difficult pattern into consideration. When it comes to the difference with partial nodes, we want to have edits
for nodes. If we want to describe inversion, the is_reverse
flag is needed. Both are represented as the mapping of the path in VG.
Since diff
is the relationship between two paths, it may be a natural way to have the two corresponding partial paths source and patient as the translation.
cf. VG's translation is defined as a pair of Path
s.
// Translations map from one graph to another.
// A collection of these provides a covering mapping between a from and to graph.
// If each "from" path through the base graph corresponds to a "to" path in an updated graph,
// then we can use these translations to project positions, mappings, and paths in the new
// graph into the old one using the Translator interface.
message Translation {
Path from = 1;
Path to = 2;
}
The power of expression of translation in VG seems enough general to represent diff-styled variant. If so, how about to represent the :translation
as a pair of :path
s? otherwise the implicit representation of paths using :translation
is restricted.
The big problem is that simplicity of syntax or query may be lost.
For vg edits of nodes, they should be represented as removing one node and inserting a path with multiple new nodes. This of course leaves the old node and the existing path unchanged, if someone wants to actually mutate the graph in RDF then SPARQL INSERT/DELETE is the technical solution. I think the VG as RDF should be a representation layer of the variation graph for data integration and visualization not for actually modifying the variation graph data.
@6br do you have a small vg example with an inversion that I can translate by hand?
My understanding is that :translation
describes the new path (with small variants) without modifying any existing graph structure. Is it right?
I am not sure if vg map
or relevant commands handle the inversion with reversed nodes, but ideally, I think the inversion should be described as below:
cf. inversion, the hand-made example of translation:
{
"from": {
"mapping": [
{
"position": {
"node_id": 126
},
"edit": [
{
"from_length": 26,
"to_length": 26,
}
]
}
]
},
"to": {
"mapping": [
{
"position": {
"node_id": 126,
"is_reverse": true
},
"edit": [
{
"from_length": 26,
"to_length": 26
}
]
}
]
}
}
In this case, the node id 126
, whose length is 26, is inverted in the patient genome.
This is also a small hand-made example visualized by tubemap.
Several nodes are flagged is_reverse
in the patient path.
"path": [{"name": "chrX", "mapping": [{"position": {"node_id": 2889127}, "rank": 6033}, {"position": {"node_id": 40154}, "rank": 6034}, {"position": {"node_id": 40155}, "rank": 6035}, {"position": {"node_id": 40156}, "rank": 6036}, {"position": {"node_id": 40157}, "rank": 6037}, {"position": {"node_id": 40158}, "rank": 6038}, {"position": {"node_id": 2889128}, "rank": 6039}, {"position": {"node_id": 2889129}, "rank": 6040}]}, {"name": "patient_chrX", "mapping": [{"position": {"node_id": 2889127}, "rank": 101}, {"position": {"node_id": 40158, "is_reverse": true}, "rank": 102}, {"position": {"node_id": 40157, "is_reverse": true}, "rank": 103}, {"position": {"node_id": 40156, "is_reverse": true}, "rank": 104}, {"position": {"node_id": 40155, "is_reverse": true}, "rank": 105}, {"position": {"node_id": 40154, "is_reverse": true}, "rank": 106}, {"position": {"node_id": 2889128}, "rank": 107}, {"position": {"node_id": 2889129}, "rank": 108}]}]}
The key RDF relation is vg:reverseOfNode
, only translating the patient path to show what it looks like. First as an independent path and then as an translation. Again prefixes and nodes are not as short as possible but selected for readability. What is called an the step in the rdf is called an position in the json. In VG RDF the "is_reverse": true
relation is encoded in the predicate as that makes some query varieties easier and cheaper.
"path": [{"name": "patient_chrX", "mapping": [
{"position": {"node_id": 2889127}, "rank": 101},
{"position": {"node_id": 40158, "is_reverse": true}, "rank": 102},
{"position": {"node_id": 40157, "is_reverse": true}, "rank": 103},
{"position": {"node_id": 40156, "is_reverse": true}, "rank": 104},
{"position": {"node_id": 40155, "is_reverse": true}, "rank": 105},
{"position": {"node_id": 40154, "is_reverse": true}, "rank": 106},
{"position": {"node_id": 2889128}, "rank": 107},
{"position": {"node_id": 2889129}, "rank": 108}]}]
step:patient_chrX_101 :rank 101 ; :node node:2889127 .
step:patient_chrX_102 :rank 102 ; :reverseOfNode node:40158 .
step:patient_chrX_103 :rank 103 ; :reverseOfNode node:40157 .
step:patient_chrX_104 :rank 104 ; :reverseOfNode node:40156 .
step:patient_chrX_105 :rank 105 ; :reverseOfNode node:40155 .
step:patient_chrX_106 :rank 106 ; :reverseOfNode node:40154 .
step:patient_chrX_107 :rank 107 ; :node node:2889128 .
step:patient_chrX_108 :rank 108 ; :node node:2889129 .
Describing the same patient map as a diff works as well.
path:patient_chrX :translation [
:source path:chrX ;
:change [ :remove step:chrX-102 ;
:insert step:patient-chrX_102 ;
:after step:chrX_101 ;
] , [ :remove step:chrX-103;
:insert step:patient-chrX_103 ;
:after step:patient_chrX_102 ;
] , [ :remove step:chrX-104;
:insert step:patient-chrX_104 ;
:after step:patient_chrX_103 ;
] , [ :remove step:chrX-104;
:insert step:patient-chrX_104 ;
:after step:patient_chrX_103 ;
] , [ :remove step:chrX-105;
:insert step:patient-chrX_105 ;
:after step:patient_chrX_105 ;
] , [ :remove step:chrX-106;
:insert step:patient-chrX_106 ;
:after step:patient_chrX_107 ;
:before step:chrX_107 ;
]
] .
step:patient_chrX_102 :rank 102 ; :reverseOfNode node:40158 .
step:patient_chrX_103 :rank 103 ; :reverseOfNode node:40157 .
step:patient_chrX_104 :rank 104 ; :reverseOfNode node:40156 .
step:patient_chrX_105 :rank 105 ; :reverseOfNode node:40155 .
step:patient_chrX_106 :rank 106 ; :reverseOfNode node:40154 .
This can be shortened if desired and expressed more compactly. Looking at it it might make more sense to have deletions and insertions directly instead of change objects.
In which case it looks like this.
path:patient_chrX :translation [
:source path:chrX ;
:remove step:chrX-102 , step:chrX-103 , step:chrX-104, step:chrX-104 , step:chrX-105 ,step:chrX-106 , step:chrX-107;
:insert step:patient_chrX_102 :rank 102 ,
step:patient_chrX_103 ,
step:patient_chrX_104 ,
step:patient_chrX_105 ,
step:patient_chrX_106 ,
] .
step:patient_chrX_102 :rank 102 ; :reverseOfNode node:40158 .
step:patient_chrX_103 :rank 103 ; :reverseOfNode node:40157 .
step:patient_chrX_104 :rank 104 ; :reverseOfNode node:40156 .
step:patient_chrX_105 :rank 105 ; :reverseOfNode node:40155 .
step:patient_chrX_106 :rank 106 ; :reverseOfNode node:40154 .
This of course assumes ranks align. While in VG they are integers, in VG RDF they can be floats so as long as order is preserved it is fine. e.g. if inserting 2 nodes and removing 1 at rank 101 the new steps at rank of 101.3
and 101.6
can be inserted
Sure, it looks enough to represent inversion.
Next, I would like to confirm how to represent mutations matching a partial node.
$ vg construct -r test/tiny/tiny.fa > test/tiny/tiny_normal.vg
$ vg index -x s.xg -k 16 -s s.gcsa test/tiny/tiny_normal.vg
$ vg map -d s -s CAAATAAGGCTTGGTAATTTTCTGGAGTTCTATTATCTTCCAACTCTCTG > s.gam
$ vg mod -i s.gam -Z s.trans test/tiny/tiny_normal.vg > test/tiny/tiny_modified.vg
$ vg view -Z -j s.trans
The original tiny_normal.vg
is below. This is composed of one node.
{"node": [{"sequence": "CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG", "id": 1}],
"path": [{"name": "x",
"mapping": [{"position": {"node_id": 1}, "edit": [{"from_length": 50, "to_length": 50}],
"rank": 1}]}]}
"CAAATAAGGCTTGGTAATTTTCTGGAGTTCTATTATCTTCCAACTCTCTG" contains two mutation (A->T and A->C). tiny_modified.vg
, which is added mutations by patient genome, is below.
{"node": [{"sequence": "CAAATAAGGCTTGG", "id": 2}, {"sequence": "A", "id": 4}, {"sequence": "T", "id": 10}, {"sequence": "AATTTTCTGGAGTTCTATTAT", "id": 6}, {"sequence": "A", "id": 8}, {"sequence": "C", "id": 11}, {"sequence": "TTCCAACTCTCTG", "id": 9}], "edge": [{"from": 2, "to": 4}, {"from": 2, "to": 10}, {"from": 4, "to": 6}, {"from": 10, "to": 6}, {"from": 6, "to": 8}, {"from": 6, "to": 11}, {"from": 8, "to": 9}, {"from": 11, "to": 9}], "path": [{"mapping": [{"position": {"node_id": 2}, "edit": [{"from_length": 14, "to_length": 14}], "rank": 1}, {"position": {"node_id": 10}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 2}, {"position": {"node_id": 6}, "edit": [{"from_length": 21, "to_length": 21}], "rank": 3}, {"position": {"node_id": 11}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 9}, "edit": [{"from_length": 13, "to_length": 13}], "rank": 5}]}, {"name": "x", "mapping": [{"position": {"node_id": 2}, "edit": [{"from_length": 14, "to_length": 14}], "rank": 1}, {"position": {"node_id": 4}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 2}, {"position": {"node_id": 6}, "edit": [{"from_length": 21, "to_length": 21}], "rank": 3}, {"position": {"node_id": 8}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 9}, "edit": [{"from_length": 13, "to_length": 13}], "rank": 5}]}]}
In this case, how should we write the :translation yielded by "CAAATAAGGCTTGGTAATTTTCTGGAGTTCTATTATCTTCCAACTCTCTG"?
Possibly, one solution is following:
path:patient_chrX :translation [
:source path:chrX ;
:remove step:chrX_1;
:insert step:patient_chrX_1, step:patient_chrX_2 , step:patient_chrX_3, step:patient_chrX_4, step:patient_chrX_5
] .
step:chrX_1 :rank 1 ; node:1 .
step:patient_chrX_1 :rank 1 ; node:2 .
step:patient_chrX_2 :rank 2 ; node:10 .
step:patient_chrX_3 :rank 3 ; node:6 .
step:patient_chrX_4 :rank 4 ; node:11 .
step:patient_chrX_5 :rank 5 ; node:9 .
But, it seems to lost the information that node 2, 6, and 9 are a substring of original node 1 and both node 10 and 11 are SNPs. Is it acceptable?
The VG's translation(s.trans) is below:
{"from": {"mapping": [{"position": {"node_id": 1},
"edit": [{"from_length": 14, "to_length": 14}]}]},
"to": {"mapping": [{"position": {"node_id": 2},
"edit": [{"from_length": 14, "to_length": 14}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 14},
"edit": [{"from_length": 1, "to_length": 1}]}]},
"to": {"mapping": [{"position": {"node_id": 4},
"edit": [{"from_length": 1, "to_length": 1}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 14},
"edit": [{"from_length": 1, "to_length": 1, "sequence": "T"}]}, {"position": {"node_id": 1, "offset": 15}}]},
"to": {"mapping": [{"position": {"node_id": 10},
"edit": [{"from_length": 1, "to_length": 1}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 15},
"edit": [{"from_length": 21, "to_length": 21}]}]},
"to": {"mapping": [{"position": {"node_id": 6},
"edit": [{"from_length": 21, "to_length": 21}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 36},
"edit": [{"from_length": 1, "to_length": 1}]}]},
"to": {"mapping": [{"position": {"node_id": 8},
"edit": [{"from_length": 1, "to_length": 1}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 36},
"edit": [{"from_length": 1, "to_length": 1, "sequence": "C"}]}, {"position": {"node_id": 1, "offset": 37}}]},
"to": {"mapping": [{"position": {"node_id": 11},
"edit": [{"from_length": 1, "to_length": 1}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 37},
"edit": [{"from_length": 13, "to_length": 13}]}]},
"to": {"mapping": [{"position": {"node_id": 9},
"edit": [{"from_length": 13, "to_length": 13}]}]}}
An offset and edits of the position(:step) are informative for describing subpath or substring of a node but introducing them is complicated.
@6br you raised a very good question and I personally don't have a good answer, I hope you can help think of one!
VG overtime does not store node relations, so while edits like this work on one graph they might not work on the graph if it is edited. Due to node normalization etc... re-embedding. I made a proposal about storing edit history in node ids but there are number of good reasons why that is a not a great idea. On the other hand the lack of deterministic or stable node-ids makes working with subgraphs outside of the local vg graph context very risky.
Which gets back to the issue that VG graphs are not stable enough for patching to work as is. A translation object is only guaranteed to work inside the local VG graph space. Any change in the local VG graph space destroys the information that is required to guarantee that such a translation applies. (e.g. node 1 might be renamed to node 5 and the translation no longer applies)
VG RDF strongly prefers fully embedded VG graphs, as substring handling in queries is an annoyance. Yet there is no way to say that a node is related to another node in edit space. And more importantly wile such relations can be made in VG RDF, VG itself won't preserve such relations.
I personally think (but would not be surprised to be very wrong here) that VG itself should drop the edit concept and just have nodes with relations.
In the end this feels like a conflict in goals between a global graph in VG RDF and a local graph in VG protobuf and I don't know how this should be resolved.
I write down my thinking and understanding.
The problem can be divided into two: to track (1) a parent‐child relationship of nodes and (2) a brother relationship of nodes. Mutations matching a partial node require to handle both problem.
ex.
node_id: 1 --- split ---> 101, 102, 103 -- a reference path
sequence: ACGTCT + ACGACT ACG - T - CT
\ /
A (104) -- a patient path
(1): 101-103 is the children of 1. But it is hard to represent the heredity relations of nodes between modified graphs since node id is not consistent and VG does not store node relations. To handle this problem: the First idea is to modify VG protobuf to manage relations of nodes. The second one is to use path or subpath as a consistent identifier. But both ideas have pros and cons.
(2): C is SNP of G in ATTGGG. If we want to describe the difference between the reference path and the patient path without splitting nodes, we have to introduce edits
of paths. It seems not preferred to represent edits or CIGARs in VG RDF because substring handling in queries is an annoyance. So, edits
should be fully embedded like 102 and 104. :translation is one solution but it may not work on edited graphs since nodes are not consistent.
Given the need to also represent structural variation RDF as a patch to the whole graph in e.g. JSON-LD. For pragmatic reasons I assume everyone knows GFA.
Bad ASCII art
We have 3 paths. and 5 nodes.
However, this is when the variation is already contained in the VG. GFA is most likely used as a "patch" onto a VG, much like a GAM.
As GFA V only introduces new nodes, and not paths (linear genomes) assume the variants come from two patients in RDF we then add these paths. i.e. we go from one reference path to three with the addition of the two patient paths.
But again we are in the VG, it is not a patch description adding new information on an existing VG.