vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

Discussion: How to represent variation and haplotypes in VG RDF #1555

Open JervenBolleman opened 6 years ago

JervenBolleman commented 6 years ago

Given the need to also represent structural variation RDF as a patch to the whole graph in e.g. JSON-LD. For pragmatic reasons I assume everyone knows GFA.

S    1    ACGTCT 
V    1    2    3    T      -> ACTT
V    1    2    3    AAA    -> ACAAAT

Bad ASCII art

  AAA
 /   \
A-CG--TCT
 \_T_/

We have 3 paths. and 5 nodes.

#Reference genome path (9 triples)
step:reference-1 path:reference ; rank 1 ; :node node:A .
step:reference-2 path:reference ; rank 2 ; :node node:CG .
step:reference-3 path:reference ; rank 3 ; :node node:TCT .

node:A rdf:value "A" ;
    :linksForwardToForward node:CG , node:T, node:AAA .
node:T rdf:value "T" ;
    :linksForwardToForward node:TCT .
node:AAA rdf:value "AAA" ;
    :linksForwardToForward node:TCT .
node:TCT rdf:value "TCT"  ;
    :linksForwardToForward node:TCT .

However, this is when the variation is already contained in the VG. GFA is most likely used as a "patch" onto a VG, much like a GAM.

As GFA V only introduces new nodes, and not paths (linear genomes) assume the variants come from two patients in RDF we then add these paths. i.e. we go from one reference path to three with the addition of the two patient paths.

#Patient 1 path (9 triples)
step:patient1-1 path:patient_1 ; rank 1 ; :node node:A .
step:patient1-2 path:patient_1 ; rank 2 ; :node node:T .
step:patient1-3 path:patient_1 ; rank 3 ; :node node:TCT .

#Patient 2 path (9 triples)
step:patient2-1 path:patient_2 ; rank 1 ; :node node:A .
step:patient2-2 path:patient_2 ; rank 2 ; :node node:AAA .
step:patient2-3 path:patient_2 ; rank 3 ; :node node:TCT .

But again we are in the VG, it is not a patch description adding new information on an existing VG.

6br commented 6 years ago

@JervenBolleman Thanks for your comment in #1557, I have understood that RDF should have fully embedded paths and nodes. But, I think the usefulness of describing variants in an easy way remains. I am not sure how to connect VG RDF with the existing mutation dataset which is represented as the "difference" between the variant and a reference: has it already been discussed? Moreover, I am interested in whether each variant on the embedded path can be annotated if the path includes many variants. A diff-style representation of variants like V tag or transition is one solution but the inter-chromosomal variants are hard to describe. In any case, I think we should discuss with some examples.

JervenBolleman commented 6 years ago

Possible solution using the translation/diff concept, continues the example from above. Here it is a small change matching a whole node.

path:patient_1 :translation [
  :source path:reference ;
  :change [ :remove step:reference-2 ; 
                  :insert step:patient-2 ;
                  :before step:reference-3 ;
                  :after step:reference-1 ;
  ]
]

I think this is the minimal example, where there is one change that removes a step and reinserts another one. It does not need to be step which is inserted here but it can be another :Path. A single :change can remove more than one :step. As an aside the :translation and :change are blank nodes and I am not sure if these should ever be identifiable outside of the document context.

Regarding the really big variants, I think life is probably easiest when fully inserting those into the VG and have them as whole "paths".

My thought is that small variants such as presentable as VCF benefit from an RDF notation. Working with large structural variation benefits too much from VG style embedding and gBWT that I don't think it makes much sense to have them as translations but should just be full paths when exposed as RDF. Tradeoff is simplicity of queries to demands on storage. Storage demands can be hacked, but difficult queries will hurt for ever.

6br commented 6 years ago

I generally agree with the idea that variations should be embedded in paths finally though there are some problems such as how to determine "full length of the path of SV" for real data. So, we should clarify the role of :translation and :path to prevent from confusing which to use. This is just an idea and I am not sure it is good or bad but I wonder if the :translation can be represented with a kind of SNP ontology and connect with the existing SNP databases using it.

6br commented 6 years ago

The minimal example looks nice. Small insertion, deletion, and polymorphism matching a whole node can be represented with :translation.

Next, we should take more difficult pattern into consideration. When it comes to the difference with partial nodes, we want to have edits for nodes. If we want to describe inversion, the is_reverse flag is needed. Both are represented as the mapping of the path in VG.

Since diff is the relationship between two paths, it may be a natural way to have the two corresponding partial paths source and patient as the translation.

cf. VG's translation is defined as a pair of Path s.

// Translations map from one graph to another.
// A collection of these provides a covering mapping between a from and to graph.
// If each "from" path through the base graph corresponds to a "to" path in an updated graph,
// then we can use these translations to project positions, mappings, and paths in the new
// graph into the old one using the Translator interface.
message Translation {
    Path from = 1;
    Path to = 2;
}

The power of expression of translation in VG seems enough general to represent diff-styled variant. If so, how about to represent the :translation as a pair of :paths? otherwise the implicit representation of paths using :translation is restricted.

The big problem is that simplicity of syntax or query may be lost.

JervenBolleman commented 6 years ago

For vg edits of nodes, they should be represented as removing one node and inserting a path with multiple new nodes. This of course leaves the old node and the existing path unchanged, if someone wants to actually mutate the graph in RDF then SPARQL INSERT/DELETE is the technical solution. I think the VG as RDF should be a representation layer of the variation graph for data integration and visualization not for actually modifying the variation graph data.

@6br do you have a small vg example with an inversion that I can translate by hand?

6br commented 6 years ago

My understanding is that :translation describes the new path (with small variants) without modifying any existing graph structure. Is it right?

I am not sure if vg map or relevant commands handle the inversion with reversed nodes, but ideally, I think the inversion should be described as below:

cf. inversion, the hand-made example of translation:

{
  "from": {
    "mapping": [
      {
        "position": {
          "node_id": 126
        },
        "edit": [
          {
            "from_length": 26,
            "to_length": 26,
          }
        ]
      }
    ]
  },
  "to": {
    "mapping": [
      {
        "position": {
          "node_id": 126,
          "is_reverse": true
        },
        "edit": [
          {
            "from_length": 26,
            "to_length": 26
          }
        ]
      }
    ]
  }
}

In this case, the node id 126, whose length is 26, is inverted in the patient genome.

6br commented 6 years ago

This is also a small hand-made example visualized by tubemap. image

Several nodes are flagged is_reverse in the patient path.

"path": [{"name": "chrX", "mapping": [{"position": {"node_id": 2889127}, "rank": 6033}, {"position": {"node_id": 40154}, "rank": 6034}, {"position": {"node_id": 40155}, "rank": 6035}, {"position": {"node_id": 40156}, "rank": 6036}, {"position": {"node_id": 40157}, "rank": 6037}, {"position": {"node_id": 40158}, "rank": 6038}, {"position": {"node_id": 2889128}, "rank": 6039}, {"position": {"node_id": 2889129}, "rank": 6040}]}, {"name": "patient_chrX", "mapping": [{"position": {"node_id": 2889127}, "rank": 101}, {"position": {"node_id": 40158, "is_reverse": true}, "rank": 102}, {"position": {"node_id": 40157, "is_reverse": true}, "rank": 103}, {"position": {"node_id": 40156, "is_reverse": true}, "rank": 104}, {"position": {"node_id": 40155, "is_reverse": true}, "rank": 105}, {"position": {"node_id": 40154, "is_reverse": true}, "rank": 106}, {"position": {"node_id": 2889128}, "rank": 107}, {"position": {"node_id": 2889129}, "rank": 108}]}]}
JervenBolleman commented 6 years ago

The key RDF relation is vg:reverseOfNode, only translating the patient path to show what it looks like. First as an independent path and then as an translation. Again prefixes and nodes are not as short as possible but selected for readability. What is called an the step in the rdf is called an position in the json. In VG RDF the "is_reverse": true relation is encoded in the predicate as that makes some query varieties easier and cheaper.

"path": [{"name": "patient_chrX", "mapping": [
{"position": {"node_id": 2889127}, "rank": 101}, 
{"position": {"node_id": 40158, "is_reverse": true}, "rank": 102}, 
{"position": {"node_id": 40157, "is_reverse": true}, "rank": 103}, 
{"position": {"node_id": 40156, "is_reverse": true}, "rank": 104}, 
{"position": {"node_id": 40155, "is_reverse": true}, "rank": 105}, 
{"position": {"node_id": 40154, "is_reverse": true}, "rank": 106}, 
{"position": {"node_id": 2889128}, "rank": 107}, 
{"position": {"node_id": 2889129}, "rank": 108}]}]
step:patient_chrX_101 :rank 101 ; :node node:2889127 .
step:patient_chrX_102 :rank 102 ; :reverseOfNode node:40158 .
step:patient_chrX_103 :rank 103 ; :reverseOfNode node:40157 .
step:patient_chrX_104 :rank 104 ; :reverseOfNode node:40156 .
step:patient_chrX_105 :rank 105 ; :reverseOfNode node:40155 .
step:patient_chrX_106 :rank 106 ; :reverseOfNode node:40154 .
step:patient_chrX_107 :rank 107 ; :node node:2889128 .
step:patient_chrX_108 :rank 108 ; :node node:2889129 .

Describing the same patient map as a diff works as well.

path:patient_chrX :translation [
  :source path:chrX ;
  :change [ :remove step:chrX-102 ; 
                  :insert step:patient-chrX_102 ;
                  :after step:chrX_101 ;
  ] ,     [ :remove step:chrX-103; 
                  :insert step:patient-chrX_103 ;
                  :after step:patient_chrX_102 ;
  ] ,     [ :remove step:chrX-104; 
                  :insert step:patient-chrX_104 ;
                  :after step:patient_chrX_103 ;
  ] ,     [ :remove step:chrX-104; 
                  :insert step:patient-chrX_104 ;
                  :after step:patient_chrX_103 ;
  ] ,     [ :remove step:chrX-105; 
                  :insert step:patient-chrX_105 ;
                  :after step:patient_chrX_105 ;
  ] ,     [ :remove step:chrX-106; 
                  :insert step:patient-chrX_106 ;
                  :after step:patient_chrX_107 ;
                  :before step:chrX_107 ;
  ] 
] .

step:patient_chrX_102 :rank 102 ; :reverseOfNode node:40158 .
step:patient_chrX_103 :rank 103 ; :reverseOfNode node:40157 .
step:patient_chrX_104 :rank 104 ; :reverseOfNode node:40156 .
step:patient_chrX_105 :rank 105 ; :reverseOfNode node:40155 .
step:patient_chrX_106 :rank 106 ; :reverseOfNode node:40154 .

This can be shortened if desired and expressed more compactly. Looking at it it might make more sense to have deletions and insertions directly instead of change objects.

In which case it looks like this.

path:patient_chrX :translation [
  :source path:chrX ;
  :remove step:chrX-102 , step:chrX-103 , step:chrX-104, step:chrX-104 , step:chrX-105 ,step:chrX-106 ,  step:chrX-107; 
  :insert step:patient_chrX_102 :rank 102 ,
step:patient_chrX_103 ,
step:patient_chrX_104 ,
step:patient_chrX_105 ,
step:patient_chrX_106 ,
] .
step:patient_chrX_102 :rank 102 ; :reverseOfNode node:40158 .
step:patient_chrX_103 :rank 103 ; :reverseOfNode node:40157 .
step:patient_chrX_104 :rank 104 ; :reverseOfNode node:40156 .
step:patient_chrX_105 :rank 105 ; :reverseOfNode node:40155 .
step:patient_chrX_106 :rank 106 ; :reverseOfNode node:40154 .

This of course assumes ranks align. While in VG they are integers, in VG RDF they can be floats so as long as order is preserved it is fine. e.g. if inserting 2 nodes and removing 1 at rank 101 the new steps at rank of 101.3 and 101.6 can be inserted

6br commented 6 years ago

Sure, it looks enough to represent inversion.

6br commented 6 years ago

Next, I would like to confirm how to represent mutations matching a partial node.

$ vg construct -r test/tiny/tiny.fa > test/tiny/tiny_normal.vg
$ vg index -x s.xg -k 16 -s s.gcsa test/tiny/tiny_normal.vg
$ vg map -d s -s CAAATAAGGCTTGGTAATTTTCTGGAGTTCTATTATCTTCCAACTCTCTG > s.gam
$ vg mod -i s.gam -Z s.trans test/tiny/tiny_normal.vg > test/tiny/tiny_modified.vg
$ vg view -Z -j s.trans

The original tiny_normal.vg is below. This is composed of one node.

{"node": [{"sequence": "CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG", "id": 1}], 
"path": [{"name": "x", 
"mapping": [{"position": {"node_id": 1}, "edit": [{"from_length": 50, "to_length": 50}], 
"rank": 1}]}]}

"CAAATAAGGCTTGGTAATTTTCTGGAGTTCTATTATCTTCCAACTCTCTG" contains two mutation (A->T and A->C). tiny_modified.vg, which is added mutations by patient genome, is below.

{"node": [{"sequence": "CAAATAAGGCTTGG", "id": 2}, {"sequence": "A", "id": 4}, {"sequence": "T", "id": 10}, {"sequence": "AATTTTCTGGAGTTCTATTAT", "id": 6}, {"sequence": "A", "id": 8}, {"sequence": "C", "id": 11}, {"sequence": "TTCCAACTCTCTG", "id": 9}], "edge": [{"from": 2, "to": 4}, {"from": 2, "to": 10}, {"from": 4, "to": 6}, {"from": 10, "to": 6}, {"from": 6, "to": 8}, {"from": 6, "to": 11}, {"from": 8, "to": 9}, {"from": 11, "to": 9}], "path": [{"mapping": [{"position": {"node_id": 2}, "edit": [{"from_length": 14, "to_length": 14}], "rank": 1}, {"position": {"node_id": 10}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 2}, {"position": {"node_id": 6}, "edit": [{"from_length": 21, "to_length": 21}], "rank": 3}, {"position": {"node_id": 11}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 9}, "edit": [{"from_length": 13, "to_length": 13}], "rank": 5}]}, {"name": "x", "mapping": [{"position": {"node_id": 2}, "edit": [{"from_length": 14, "to_length": 14}], "rank": 1}, {"position": {"node_id": 4}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 2}, {"position": {"node_id": 6}, "edit": [{"from_length": 21, "to_length": 21}], "rank": 3}, {"position": {"node_id": 8}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 9}, "edit": [{"from_length": 13, "to_length": 13}], "rank": 5}]}]}

In this case, how should we write the :translation yielded by "CAAATAAGGCTTGGTAATTTTCTGGAGTTCTATTATCTTCCAACTCTCTG"?

Possibly, one solution is following:

path:patient_chrX :translation [
  :source path:chrX ;
  :remove step:chrX_1; 
  :insert step:patient_chrX_1, step:patient_chrX_2 , step:patient_chrX_3, step:patient_chrX_4, step:patient_chrX_5
] .
step:chrX_1 :rank 1 ; node:1 .
step:patient_chrX_1 :rank 1 ; node:2 .
step:patient_chrX_2 :rank 2 ; node:10 .
step:patient_chrX_3 :rank 3 ; node:6 .
step:patient_chrX_4 :rank 4 ; node:11 .
step:patient_chrX_5 :rank 5 ; node:9 .

But, it seems to lost the information that node 2, 6, and 9 are a substring of original node 1 and both node 10 and 11 are SNPs. Is it acceptable?

The VG's translation(s.trans) is below:

{"from": {"mapping": [{"position": {"node_id": 1}, 
    "edit": [{"from_length": 14, "to_length": 14}]}]}, 
  "to": {"mapping": [{"position": {"node_id": 2}, 
    "edit": [{"from_length": 14, "to_length": 14}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 14}, 
    "edit": [{"from_length": 1, "to_length": 1}]}]}, 
  "to": {"mapping": [{"position": {"node_id": 4}, 
    "edit": [{"from_length": 1, "to_length": 1}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 14}, 
    "edit": [{"from_length": 1, "to_length": 1, "sequence": "T"}]}, {"position": {"node_id": 1, "offset": 15}}]}, 
  "to": {"mapping": [{"position": {"node_id": 10}, 
    "edit": [{"from_length": 1, "to_length": 1}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 15}, 
    "edit": [{"from_length": 21, "to_length": 21}]}]}, 
  "to": {"mapping": [{"position": {"node_id": 6}, 
    "edit": [{"from_length": 21, "to_length": 21}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 36}, 
    "edit": [{"from_length": 1, "to_length": 1}]}]}, 
  "to": {"mapping": [{"position": {"node_id": 8}, 
    "edit": [{"from_length": 1, "to_length": 1}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 36}, 
    "edit": [{"from_length": 1, "to_length": 1, "sequence": "C"}]}, {"position": {"node_id": 1, "offset": 37}}]}, 
  "to": {"mapping": [{"position": {"node_id": 11}, 
    "edit": [{"from_length": 1, "to_length": 1}]}]}}
{"from": {"mapping": [{"position": {"node_id": 1, "offset": 37}, 
    "edit": [{"from_length": 13, "to_length": 13}]}]}, 
  "to": {"mapping": [{"position": {"node_id": 9}, 
    "edit": [{"from_length": 13, "to_length": 13}]}]}}

An offset and edits of the position(:step) are informative for describing subpath or substring of a node but introducing them is complicated.

JervenBolleman commented 6 years ago

@6br you raised a very good question and I personally don't have a good answer, I hope you can help think of one!

VG overtime does not store node relations, so while edits like this work on one graph they might not work on the graph if it is edited. Due to node normalization etc... re-embedding. I made a proposal about storing edit history in node ids but there are number of good reasons why that is a not a great idea. On the other hand the lack of deterministic or stable node-ids makes working with subgraphs outside of the local vg graph context very risky.

Which gets back to the issue that VG graphs are not stable enough for patching to work as is. A translation object is only guaranteed to work inside the local VG graph space. Any change in the local VG graph space destroys the information that is required to guarantee that such a translation applies. (e.g. node 1 might be renamed to node 5 and the translation no longer applies)

VG RDF strongly prefers fully embedded VG graphs, as substring handling in queries is an annoyance. Yet there is no way to say that a node is related to another node in edit space. And more importantly wile such relations can be made in VG RDF, VG itself won't preserve such relations.

I personally think (but would not be surprised to be very wrong here) that VG itself should drop the edit concept and just have nodes with relations.

In the end this feels like a conflict in goals between a global graph in VG RDF and a local graph in VG protobuf and I don't know how this should be resolved.

6br commented 6 years ago

I write down my thinking and understanding.

The problem can be divided into two: to track (1) a parent‐child relationship of nodes and (2) a brother relationship of nodes. Mutations matching a partial node require to handle both problem.

ex.

node_id: 1       --- split --->     101, 102, 103   -- a reference path
sequence:  ACGTCT +  ACGACT         ACG - T - CT
                                        \   /  
                                          A (104)   -- a patient path

(1): 101-103 is the children of 1. But it is hard to represent the heredity relations of nodes between modified graphs since node id is not consistent and VG does not store node relations. To handle this problem: the First idea is to modify VG protobuf to manage relations of nodes. The second one is to use path or subpath as a consistent identifier. But both ideas have pros and cons.

(2): C is SNP of G in ATTGGG. If we want to describe the difference between the reference path and the patient path without splitting nodes, we have to introduce edits of paths. It seems not preferred to represent edits or CIGARs in VG RDF because substring handling in queries is an annoyance. So, edits should be fully embedded like 102 and 104. :translation is one solution but it may not work on edited graphs since nodes are not consistent.

6br commented 5 years ago

https://github.com/vgteam/vg/wiki/VG-RDF:-proposal-for-representation-of-variation-on-VG-RDF