vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

Inserting existing annotations on a genome 'path' into a vg graph #1381

Closed JervenBolleman closed 6 years ago

JervenBolleman commented 6 years ago

I would like to add existing genome annotations to a vg graph. The conceptual straightforward way is to see each genome annotation as a new path in the graph.

assume linear genome:a=> aaccaaggcc we have a gene:cc=>cc we know from existing work that this gene matches the first 'cc' pair in the linear path.

if we have vg graph like

aa-c-c-aagg-cc
  \t-a/    \gg

so we have genome:a 1,2,3,4,5 and genome:b 1,7,8,4,9 then our gene is a path of node 2 and node 3

Method wise is this something that can be easily done using existing vg commands? Especially introducing new nodes as required when adding new "gene" paths?

adamnovak commented 6 years ago

I think if you express your annotations as GAM alignments to the graph, you can use the -i option of vg mod to embed them, breaking any nodes.

What you would need then would be a way to convert an annotation specified in path coordinates into a mapping in node coordinates. I don't think we have anything that does that right now. You can use vg find -p {path name}:{start}-{end} to extract the nodes relevant to a given path region, but then you would need to use vg find -n {node} -P {path name} to nget the actual coordinate of each returned node in the path, and use your own code to synthesize a perfect-match alignment to the relevant nodes.

You might be able to get away with extracting the region relevant to the gene, mapping the gene's reference sequence against that region with vg align, and then embedding that alignment in the original graph with vg mod -i, if you assume that the gene's sequence will align the right way to the extracted graph region.

JervenBolleman commented 6 years ago

Is there some documentation for the GAM file available. My google/bing attempts are not getting me to any helpful pages.

subwaystation commented 6 years ago

If I may step in, the ag (annotation graph) code from Computomics offers a way to store annotation. That is each node can be annotated by a list of IDs. Each of these IDs correspond to a feature. Of course the nodes as well as the annotation are still indexible.

For more please checkout https://gitlab.codenic.de/computomics/ag/blob/master/src/ag.hpp. And for the modified xg: https://gitlab.codenic.de/computomics/xg

I think it could be worth to take a look at it.

JervenBolleman commented 6 years ago

@subwaystation thank you for letting me know about ag, but it is a technical direction I am personally not interested in. My aim is to use VG for building variation graphs and for sequence orientated searches, but use SPARQL and RDF for annotation and integration with existing systems. Mostly my aim is to combine genomic variation graphs with protein variation graphs with all of the richness of Ensembl and UniProt annotation and I do not think it economically feasible to push this all into a VG/XG like system.

subwaystation commented 6 years ago

@JervenBolleman That makes total sense from my point of view. Thanks for pointing it out. I am curious how you are going to implement this :)

JervenBolleman commented 6 years ago

@subwaystation An old wiki page about the general concept is here https://github.com/vgteam/vg/wiki/Annotating-a-VG-graph-the-RDF-way and the specific hack attack I would like to pick up is here https://github.com/vgteam/vg/wiki/VG-RDF,-the-Ensembl-bacteria-E.-coli-genome-hack-attack

Which was stopped at the time exactly at the point of needing to add "gene" paths to the VG. With the node breaking in the existing genome paths logic being the most significant issue.

subwaystation commented 6 years ago

@JervenBolleman Thanks! One question: So we kind of annotated node 1-3,4-1,and 5-2 to be part of the reference genome. Why don't you take the actual (I think unique) node IDs of VG? Or is this just to exemplify what you have in mind?

JervenBolleman commented 6 years ago

@subwaystation its just an example to illustrate the thinking. Basically I see current existing gene annotations as sub paths of the reference genome path. So yes it would use the nodeIDs of VG (of course there is whole different set of problems with current node ID assignment in VG, that is the topic of different issues/pages).

In the JSON syntax we have the existing VG graph of the two genomes

{
  "node": [{"sequence": "AA", "id": 1},
        {"sequence": "C", "id": 2},
        {"sequence": "C", "id": 3},
        {"sequence": "AAGG", "id": 4},
        {"sequence": "T", "id": 5},
        {"sequence": "A", "id": 6},
        {"sequence": "AA", "id": 7},
        {"sequence": "AG", "id": 8}
    ],
  "path": [ { "name" : "genome 1", 
                     "mapping":[ 
                       {"position":{"nodeid":1}},
                       {"position":{"nodeid":2}},
                       {"position":{"nodeid":3}},
                       {"position":{"nodeid":4}},
                       {"position":{"nodeid":5}}]},
               { "name" : "genome 2", 
                    "mapping":[ 
                       {"position":{"nodeid":1}},
                       {"position":{"nodeid":7}},
                       {"position":{"nodeid":8}},
                       {"position":{"nodeid":4}},
                       {"position":{"nodeid":9}}]},
             { "name" : "genome 1 gene JJ", 
                    "mapping":[ 
                       {"position":{"nodeid":2}},
                       {"position":{"nodeid":3}}]},
              ] ,
  }
}

This is the level we can do today without any further changes required from VG. At least we can hack this from the RDF side.

The challenge is when we want to add a new gene-path that requires us to split an existing node. e.g. we add a gene "aaag" that uses node:6 and the first three nucleotides of node:4 requiring us to split node:4 into node:4a and node:4b and update the existing paths to have the new node identifiers etc...

Thus modifying the original graph into the following one.

{
  "node": [{"sequence": "AA", "id": 1},
        {"sequence": "C", "id": 2},
        {"sequence": "C", "id": 3},
        {"sequence": "AAG", "id": 4a},
        {"sequence": "G", "id": 4b},
        {"sequence": "T", "id": 5},
        {"sequence": "A", "id": 6},
        {"sequence": "AA", "id": 7},
        {"sequence": "AG", "id": 8}
    ],
  "path": [ { "name" : "genome 1", 
                     "mapping":[ 
                       {"position":{"nodeid":1}},
                       {"position":{"nodeid":2}},
                       {"position":{"nodeid":3}},
                       {"position":{"nodeid":4a}},
                       {"position":{"nodeid":4b}},
                       {"position":{"nodeid":5}}]},
               { "name" : "genome 2", 
                    "mapping":[ 
                       {"position":{"nodeid":1}},
                       {"position":{"nodeid":7}},
                       {"position":{"nodeid":8}},
                       {"position":{"nodeid":4a}},
                       {"position":{"nodeid":4b}},
                       {"position":{"nodeid":9}}]},
             { "name" : "genome 1 gene JJ", 
                    "mapping":[ 
                       {"position":{"nodeid":2}},
                       {"position":{"nodeid":3}}]},
             { "name" : "genome 1 gene KK", 
                    "mapping":[ 
                       {"position":{"nodeid":3}},
                       {"position":{"nodeid":4a}}]},
              ] ,
  }
}

PS. JSON syntax adapted from example https://github.com/vgteam/vg/blob/master/test/cyclic/reverse_self.json which is either out of date or something is odd with the json vg.

subwaystation commented 6 years ago

@JervenBolleman Thanks for the illustration. In the AG, this is actually solved as you mentioned above: The nodes are split according to the annotation. But I am not quite sure if I understood your problem correctly: There are paths in the graph of both genomes and genes. And when you add these genes as paths in the graph, nodes have to be split and therefore the node IDs change. The JSON should be updated automatically, right? But you miss the mapping of this on the RDF site? Because the IDs changed?

JervenBolleman commented 6 years ago

@subwaystation The problem is that there is a lack of an tool to add the genes as paths in the graph in VG today. Specifically using the existing tools in side VG itself. Which is why @adamnovak suggested building a GAM file and using that. Unfortunately I can't find a spec for a GAM file :(

subwaystation commented 6 years ago

Ah. I thought that you would insert them as follows: GFF -> FASTA. Then align all sequences. But they could be aligned anywhere of course. Actually I thought that what you want is possible, yet. Obviously it is not. But the rest of your concept makes looks good.

nerdstrike commented 6 years ago

I'd suggest spooling path entities directly into the protobuf, but I think it needs rather a lot of raw information about nodes in the path to be.

JervenBolleman commented 6 years ago

@nerdstrike that is as complicated as direct fixing the RDF or JSON. Mostly it's about efficiently splitting nodes and fixing up existing paths. I think someone will need to write a new vg subcommand for it. I was thinking of naming it inject but that might get confusing with surject.

adamnovak commented 6 years ago

The GAM "spec" right now is whatever is written by stream.hpp operating on Alignment objects from vg.proto.

Basically it ends up as a gzip-compressed series of chunks, with each chunk consisting of an int64 record count, followed by that many int32-length-prefixed Protobuf-serialized Alignments. The lengths and counts are encoded with Protobuf's variable-length int encoding.

However, rather than just dumping that out manually, you might be better off generating JSON to describe the alignment you want and running it through vg view -JGa to serialize it in binary GAM format.

glennhickey commented 6 years ago

Note that you can leave out everything from your Alignment records apart from a Name and a Path if you want.

So if you can express your annotations as a list of name/path pairs in JSON like

{"path": {"mapping": [{"position": {"node_id": 1, "offset": 813, "is_reverse": true}, "edit": [{"from_length": 9, "to_length": 9}], "rank": 1}]}, "name": "annotation1"} {"path": {"mapping": [{"position": {"node_id": 1}, "edit": [{"from_length": 9, "to_length": 9}], "rank": 1}]}, "name": "annotation2"}

etc.

You can convert the list into gam like Adam said with vg view -JGa. Then you can embed that gam as paths in the graph with vg mod -i. This will take care of breaking nodes and updating existing paths.

On Thu, Jan 18, 2018 at 1:01 PM, Adam Novak notifications@github.com wrote:

The GAM "spec" right now is whatever is written by stream.hpp https://github.com/vgteam/vg/blob/1f9951060e8c6be34fbd2a313d1f748e5df859c4/src/stream.hpp operating on Alignment objects from vg.proto https://github.com/vgteam/vg/blob/b6b2fd52880db49f69cef571977d8a8f6a162eb3/src/vg.proto#L104-L141 .

Basically it ends up as a gzip-compressed series of chunks, with each chunk consisting of an int64 record count, followed by that many int32-length-prefixed Protobuf-serialized Alignments. The lengths and counts are encoded with Protobuf's variable-length int encoding.

However, rather than just dumping that out manually, you might be better off generating JSON to describe the alignment you want and running it through vg view -JGa to serialize it in binary GAM format.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1381#issuecomment-358730249, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7hSUHmC6k37hLzLhbgcAqUYmsLRMks5tL4b3gaJpZM4Rh-nO .

JervenBolleman commented 6 years ago

@glennhickey @adamnovak thanks that is very helpful. Will try to play with this in the next days.

ekg commented 6 years ago

I'm sorry that I haven't joined this conversation yet. In the hackathon I built a function in vg annotate that converts BED records into GAM records. I apologize that it is not well documented. It needs to be properly tested.

vg annotate -b x.bed -x x.xg >x.gam

Then you can add the alignments to the graph and you will have embedded them. If GBWT would support lookup related to path names then we would have an annotation database on vg. (Does it yet?)

JervenBolleman commented 6 years ago

@ekg that is very helpful. Unfortunately the Ensembl bacteria species I have access to do not have BED files. I did look at the annotate command and was thinking about using the --novelty option but was not sure if I would have enough information to make it work.

ekg commented 6 years ago

@JervenBolleman I see. How are the genes annotated then? I'm a little confused. If you have an example data set for a small region we could include it in the distribution as a way to drive tests.

I do not believe that node ids should be a stable concept. What should be stable are the paths. Through them we can construct a query for a particular graph position that will be stable as long as the graphs we are querying have the same embedded path. If node ids need to be stable then we can simply begin with 1bp nodes, then all the problems with node splitting and merging disappear. These are only there for performance reasons. Keeping a node id per graph position is expensive in most use cases, and I would not recommend it.

@subwaystation I am somewhat confused by AG. Why do you need to add a new concept for annotation to the graph? Are you familiar with the GBWT and gPBWT models that allow us to encode large numbers of annotations (aka haplotypes / paths / alignments) into compact queryable indexes that relate these paths to the graph? To support the colored graph concept that you are developing, it would be enough to use single-node paths with a consistent naming pattern or prefix. Of course doing so will be no more efficient in GBWT than in simpler encodings, but the concept will be generic to longer and shorter paths. If you are interested in this technical development, I would suggest finding how these models are insufficient in their current implementations and then building on them. That said you are completely free to do as you wish with the codebase! This is just a suggestion that may improve the reach of your work and prevent us from duplicating effort.

nerdstrike commented 6 years ago

@JervenBolleman For proof of concept, you could consider extracting BED from rest.ensemblgenomes.org/overlap/$species/$region?feature=gene;content-type=text/x-bed

Sadly it's capped on region size to 500kb out of necessity.

@ekg Ensembl don't produce BED in bulk. You'd normally make them for genome browsers, but our browser runs direct off the database. Not-awful alternatives are sadly limited to GFF. Everything else would need code to unpack the correct coordinates.

ekg commented 6 years ago

@nerdstrike but can't we convert GFF into BED? Am I missing something obvious?

nerdstrike commented 6 years ago

If you like. One more hurdle. Us service developers are averse to these circuitous processes - they get fragile in production. It's more fun to bug developers to expose the right interface for us :)

JervenBolleman commented 6 years ago

Big discussion on side topics and I think we should split that out into independent topics. @ekg would you mind opening an issue on node identifiers and if they should be stable or deterministic? It distracts from this issue.

Also what I think off as annotations are things like go terms etc.. not so much gene locations as such. i.e. not what it is but what it does. Especially I am interested in derivative product annotation i.e. mRNA, proteins and even their 3D structures as how it relates to genomic variation.

@nerdstrike yes totally! ruby goldberg machines are cool on youtube not on saturday morning when your timesenstive production cycle blows up because one of the domino's fell over to early.

6br commented 6 years ago

I would like to describe the usage of vg annotate -b BED -x XG -p. A concept (but undocumented...) is here https://github.com/vgteam/vg/pull/1036.

As long as annotations are the subpath of existing paths represented like following BED format, they can be converted into GAM format.

This is the graph generated from test/tiny/tiny.vcf.gz and test/tiny/tiny.fa. x3 If you would like to get an alignment of GAAA from node 8 to node 9, which is a subpath of path x, BED file is described as following.

x      13        17      gene

It means there is an annotation on the path x from 13 to 17 named "gene". It can be converted into GAM format following via vg annotate -b example.bed -p -x tiny.xg.

{"path": {"mapping": [{"edit": [{"from_length": 1, "to_length": 1}], "position": {"node_id": 8}, "rank": 5}, {"edit": [{"from_length": 3, "to_length": 3}], "position": {"node_id": 9}, "rank": 6}]}, "name": "gene"}

Moreover, it can be integrated on VG using vg mod -P --include-aln.

result

In this case, node identifiers are preserved but the offset of annotations is not preserved. Otherwise, nodes can split according to alignments using vg mod --include-aln.

result2

I found bugs in this function, so I fixed it (https://github.com/vgteam/vg/pull/1386). And I think I can add an interface to accept GFF or GTF format in addition to BED format if you need.

JervenBolleman commented 6 years ago

@ekg I forgot there is an already existing issue #112 which is about stable node identifiers.

subwaystation commented 6 years ago

@ekg Honestly, I am not quite familiar with the annotation concept implemented in vg. What I got so far is that annotations are basically paths through the graph, right? Can you point out to me, where I could find deep information about GBWT and gPBWT? I did not code any part of AG, so sometimes I am confused about it, too ;) However, what I would have suspected is that the direct annotation of a single node is much faster in practice than the annotation of a single node with a path for each annotation. But that is pure theory here. Currently I am not developing anything, neither code for the graph nor code for the VIZ. All of what you can observe in AG and AGV is as it is. Also, because I am not working for CTX. But, I completely am on your side about the duplication effort. I would not have wanted that either. Thanks for the helpful tips!

JervenBolleman commented 6 years ago

I think this issue can be closed since Ensembl Bacteria now produces BED files and this can be used with vg annotate, index and mod to merge the gene annotation from Ensembl genomes into a VG pan genome graph. How this can be done is discussed in the wiki