Backend Graph Specification

graph-genome / graph_summarization

Browser for Graph Genomes built with VG based on Graph Summarization to provide semantic zoom. As a user zooms in on a graph genome, the topology becomes more complex. Provides visualization for variation within a species of plant or animal. Designed to scale up to thousands of specimens and provide useful visualizations.

Other

7 stars 1 forks source link

Backend Graph Specification #4

Closed 6br closed 5 years ago

6br commented 5 years ago

((Updated by Josiah 2019-07-10)) ((Updated by Josiah 2019-08-15: Removed PathIndex and Slices))

Version 1 - Basics

GraphGenome is the top level object which holds references to:
- The set of all nodes in the Graph by uuid
- Set of all Paths in the Graph by accession name
- One path per accession (contiguous). Each chromosome = ?
- Method for topological sort to present graph nodes in a consistent order every time
Paths contain
- One unique accession (a.k.a. specimen) name
- List of NodeTraversals in order (reverse lookup)
- Can visit the same node multiple times
- Each node traversal is tracked by NodeTraversal and annotations (epigenetic state, etc.) is tied to a NodeTraversal, not the Node itself.
- Ex. 5+ 9+ 10+ 50+ 23- 78+
Node contains:
- Sequence (optional)
- uuid to distinguish from other nodes of same sequence
- Set of Node Traversals intersecting the node (reverse lookup)
- You can ask each Traversal what is the next "downstream" or "upstream" node. Meaning you can build a set of neighboring nodes, including self for duplications.
- Ex. { (Path5, Order 67, +) (Path10, Order 160, -), (Path5, Order 2756, +) }
NodeTraversal
- Node - being traversed
- Path - the specimen doing the traversing
- strand: either + or - represents reverse complements and inversions
- Order - order in which the traversal occurs in this path
- e.g. 5 would be the 5th node visited by the Path and N+1 is the downstream node

6br commented 5 years ago

Version 2 - Summary Layers

GraphGenome has a list of SummaryLayers, at increasing levels of summarization.
Node Changes:
- Node has link to summary_parent: a node in the next SummaryLayer up.
- Backlinks from one node can lead to multiple, more specific nodes when zooming in.
Nodes: Tracking Alternative Haplotypes
- While summarizing some sequence information was lost. We need to quantify how much.
- Nodes from split_groups contain a link to their alternative_allele and a sequence size and percentage identity.
- alternative_allele ForeignKey(Node)
- alternative_overlap_bp number of base pairs the two alleles agree on
- When simple_merge Nodes with alternative_allele, alternative_overlap_bp can be added to ensure percentages don't drift based on nodes of differing sizes.
Path changes:
- Each specimen will have one Path per SummaryLayer
- Paths belong to one summary level. Paths can be built using the transitive property on Node summary_parents.

((Updated by Josiah 2019-08-15: Clarified Ribbon vs. Path. Added AggregateTraversal))

Version 3 - Ribbon Haplotypes

Key idea: instead of sending 1,000 paths for 1,000 specimens, there are ~20 Ribbons sent with specimens counts. A detail query returns a list of specimens in each ribbon. Ribbons are a container to summarize many paths that share commonality i.e. a haplotype.
Ribbons contain:
- Dendrogram of paths computed like a gene tree of similar sequences, show how diverged each pair of specimens are
- Similarity is defined by the sequence of nodes along their path
- Branch lengths for similarity
- Presented Ribbons are not hard boundaries, just a cutoff based on data and level of detail. There is a fundamental problem with coloring local versus global solutions will not match. Further discussion/investigation is required.
AggregateTraversal (NodeTraversal): Ribbons have higher level traversal objects
- coverage_count The number of specimens inside this ribbon that actually traverse this node. From this can be calculated the number of specimens bypassing a node.
- Possibly sum in copy number expansion as well (this is summarizing)

Version 4 - Annotation

Annotation Stack is a list of annotations
Annotation metadata is a tree/graph of annotations that represents relationship between annotations
- For example, a gene contains isoforms, and an isoform contain introns and exons.
Annotation has two types of
- Contiguous annotation (i.e. paths)
- Continuous value annotation (vg pack format, gGFF format) (WIP)

josiahseaman commented 5 years ago

[x] Did you end up choosing Node storing a Path index, or Nodes and Paths both pointing to an Intersection object? Both versions are currently in the spec. Once you've decided, let's remove the redundant spec.
[x] In version 2, checkpoints should probably be their own object. Paths would contain a list of segments with a checkpoint at the beginning of every segment. I'm still not convinced checkpoints are ideal, but it's an acceptable rough draft.

6br commented 5 years ago

For the first issue, I prefer to store a Path index into Node. Because it is simpler than using Intersection object.

6br commented 5 years ago

For the second issue, checkpoints seem ad-hoc ideas to represent nucleotide coordinates on paths, but they are useful when users wish to understand what part of the whole genome do they see. Suppose if paths contain a list of segments with a checkpoint, and if you want to merge two segments across the checkpoint, then it becomes unclear how to handle the checkpoint. However, if you specify checkpoints on the farthest view (aka. bird's eyes), then the checkpoints can propagate to all zoom layers including nucleotide zoom level because we can assume that we do not merge nodes on zooming up. I wonder if there is a good way to represent checkpoints in this spec.

josiahseaman commented 5 years ago

Check points are ad hoc, but they may scale very nicely all the same. To be clear, checkpoints are just a speed optimization feature. It's possible to start at the beginning of a path and count every nucleotide in every node until you reach the index you are looking for. Checkpoints are just a precomputed sum stored haphazardly throughout the path. So you can delete a checkpoint and nothing bad happens, it just takes fractionally longer to calculate. As the graph is iteratively summarized, checkpoints would become increasingly dense and outnumber nodes, which is bad. So every time a merge operation spans a checkpoint, just delete the checkpoint and append the two lists together. That way there's roughly the same ratio of checkpoint to node. The nice thing about preserving checkpoints at large scale that were precomputed at nucleotide scale is that there's no rounding error propagation. They'll stay as accurate as the graph allows.

The competitor solution is to have a "coordinate mapping" service that's a completely separate data structure. This may be cleaner in the long run. Either way, we don't have to decide yes/no on checkpoints immediately. We need to decide #5 first.

@ekg Does VG already have a better solution for the problem of mapping old TAIR10 coordinates onto a graph in random access?

ekg commented 5 years ago

@josiahseaman what's a TAIR10 coordinate?

In vg we have the xg index, which handles coordinate mappings of the graph. I've submitted a proposal for the upcoming Biohackathon that extends this model with a server interface that could be used by graph genome browsers.

josiahseaman commented 5 years ago

Great! That sounds like exactly what we're looking for. So we'd use the service you create at BioHackathon in combination with the links from node to summary node (Version 2 - Summary Stacks) to be able to find for example, the beginning and end node at zoom level 3 corresponding to nucleotides 23,758,110 and 45,687,084. That'd be two calls to Erik's service and three lookups to iterate up zoom levels. I think we can put that question to rest now.

P.S. TAIR10 is just an example of a particular genome assembly, I could also have said Hg38 coordinate just to point out the coordinate concept is dependent on a particular assembly.

ekg commented 5 years ago

xg provides graph to genome path position mappings. That's pretty much the basis for these assembly coordinate systems.

6br commented 5 years ago

My goal: I have a python data interface to interact with xg server on version 1.

josiahseaman commented 5 years ago

Toshiyuki: "We have several candidates to worth trying as a backend data store. First I tried xg. Erik will implement a server version of xg, but it is not yet available. So, I decided to use a binary version of xg so far. Next, we need to interact with xg using GFA format. I searched the library to represent gfa format, and I found gfapy. But, due to dialects of GFA format, the output of gfapy and xg is different in several cases. Also, there is no interface to compare gfa formats. I implemented an initial version of interface between gfa and xg (gfa.py). I am not satisfied with the function of gfapy itself. Another approach is to have our own data format in python class (graph.py). If we select this approach, I need to convert them into gfa format to store in xg format or other file formats. We first need to choose two approach (1) "

josiahseaman commented 5 years ago

This week:

[x] I'll split out my Graph object code to a python file and push it to GitHub.
[x] Toshiyuki will add import from GFA or VG into our python object format.
[x] Finish Graph summarization with sequence for Split_vertical, propagate_split, merge_horizontal

josiahseaman commented 5 years ago

As I was working on commit 098b4cbcf8ceb0375a0138ddb084566d6934fe79 I was taking notes for @6br . There were a couple of decisions I wasn't sure about. The big change was the lack of slices in Graph class. Instead, everything is just Nodes and Paths. I ended up keeping dictionaries of both which should be replaced with a database lookup in the future.

I managed to replace GFA.to_graph with only this:

    def to_graph(self):
        # Extract all paths into graph
        path_names = [p.name for p in self.gfa.paths]
        graph = Graph(path_names)  # Paths can be empty at start
        for path in self.gfa.paths:
            for node in path.segment_names:
                graph.append_node_to_path(node.name, node.orient, path.name)
        for segment in self.gfa.segments:
            graph.nodes[segment.name].seq = segment.sequence
        return graph

This build the full sequence for all accessions and contains the full graph structure. But it doesn't do slices or topological sort. The GFA lines: L 1 + 2 + 0M have not yet been addressed.

Why is to_gfa a @property? There's nothing incorrect, it just surprised me.

The future of graph tests

My mockup declaration format has reached the end of its life. You really can't declare inversions or any profoundly non-linear structure with this format: [['CAAATAAG', {x, y, z}], ['A', {y, z}, 'G', {x}], ['C', {x, y, z}], ['TTG', {x, y, z}], ['A', {z}, 'G', {x, y}], ['AAATTTTCTGGAGTTCTAT', {x, y, z}], ['T', {x, y, z}], ['ATAT', {x, y, z}], ['T', {x, y, z}], ['CCAACTCTCTG', {x, y, z}]]. It's useful but adding strands to the path mentions is not equivalent to having strands in paths when traversing nodes. We may just want to have a bunch of pickle files from here on out to support our tests. When we first make a test, we use a debugger to look at the structure and verify it by hand, then pickle it to a file and compare against the pickle in our test. The test test_load_gfa_to_graph() isn't passing but that is because it has nothing to compare to. As far as I can tell, the gfa.to_graph function is working correctly.

6br commented 5 years ago

Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.

As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.

pathA: 1+, 2+, 3+, 4+ 
pathB: 1+, 3+, 2+, 4+

In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.

josiahseaman commented 5 years ago

Node should definitely have an ID, it's currently the key in my node dictionary. I'm uncomfortable with this implementation because it should be a database lookup, but we haven't started django or flare yet. We need to get your branch and mine merged ASAP. If you've decided on flare, we need it for node id tracking.

In going to subclass Graph to SlicedGraph, so we can separate sort and slicing to it's own optional step.

On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:

Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.

As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.

pathA: 1+, 2+, 3+, 4+

pathB: 1+, 3+, 2+, 4+

In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .

On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:

Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.

As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.

pathA: 1+, 2+, 3+, 4+

pathB: 1+, 3+, 2+, 4+

In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .

josiahseaman commented 5 years ago

Minor note: node name should be have + that's path specific, not specific to the node.

On Wed, Jul 10, 2019, 11:21 Josiah Seaman josiah.seaman@gmail.com wrote:

Node should definitely have an ID, it's currently the key in my node dictionary. I'm uncomfortable with this implementation because it should be a database lookup, but we haven't started django or flare yet. We need to get your branch and mine merged ASAP. If you've decided on flare, we need it for node id tracking.

In going to subclass Graph to SlicedGraph, so we can separate sort and slicing to it's own optional step.

On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:

Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.

As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.

pathA: 1+, 2+, 3+, 4+

pathB: 1+, 3+, 2+, 4+

In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .

On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:

Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.

As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.

pathA: 1+, 2+, 3+, 4+

pathB: 1+, 3+, 2+, 4+

In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .

6br commented 5 years ago

I confirmed modifications on NodeTraversal and PathIndex.