graph-genome / graph_summarization

Browser for Graph Genomes built with VG based on Graph Summarization to provide semantic zoom. As a user zooms in on a graph genome, the topology becomes more complex. Provides visualization for variation within a species of plant or animal. Designed to scale up to thousands of specimens and provide useful visualizations.
Other
7 stars 1 forks source link

Backend Graph Specification #4

Closed 6br closed 5 years ago

6br commented 5 years ago

((Updated by Josiah 2019-07-10)) ((Updated by Josiah 2019-08-15: Removed PathIndex and Slices))

Version 1 - Basics

6br commented 5 years ago

Version 2 - Summary Layers

((Updated by Josiah 2019-08-15: Clarified Ribbon vs. Path. Added AggregateTraversal))

Version 3 - Ribbon Haplotypes

Version 4 - Annotation

josiahseaman commented 5 years ago
6br commented 5 years ago

For the first issue, I prefer to store a Path index into Node. Because it is simpler than using Intersection object.

6br commented 5 years ago

For the second issue, checkpoints seem ad-hoc ideas to represent nucleotide coordinates on paths, but they are useful when users wish to understand what part of the whole genome do they see. Suppose if paths contain a list of segments with a checkpoint, and if you want to merge two segments across the checkpoint, then it becomes unclear how to handle the checkpoint. However, if you specify checkpoints on the farthest view (aka. bird's eyes), then the checkpoints can propagate to all zoom layers including nucleotide zoom level because we can assume that we do not merge nodes on zooming up. I wonder if there is a good way to represent checkpoints in this spec.

josiahseaman commented 5 years ago

Check points are ad hoc, but they may scale very nicely all the same. To be clear, checkpoints are just a speed optimization feature. It's possible to start at the beginning of a path and count every nucleotide in every node until you reach the index you are looking for. Checkpoints are just a precomputed sum stored haphazardly throughout the path. So you can delete a checkpoint and nothing bad happens, it just takes fractionally longer to calculate. As the graph is iteratively summarized, checkpoints would become increasingly dense and outnumber nodes, which is bad. So every time a merge operation spans a checkpoint, just delete the checkpoint and append the two lists together. That way there's roughly the same ratio of checkpoint to node. The nice thing about preserving checkpoints at large scale that were precomputed at nucleotide scale is that there's no rounding error propagation. They'll stay as accurate as the graph allows.

The competitor solution is to have a "coordinate mapping" service that's a completely separate data structure. This may be cleaner in the long run. Either way, we don't have to decide yes/no on checkpoints immediately. We need to decide #5 first.

@ekg Does VG already have a better solution for the problem of mapping old TAIR10 coordinates onto a graph in random access?

ekg commented 5 years ago

@josiahseaman what's a TAIR10 coordinate?

In vg we have the xg index, which handles coordinate mappings of the graph. I've submitted a proposal for the upcoming Biohackathon that extends this model with a server interface that could be used by graph genome browsers.

josiahseaman commented 5 years ago

Great! That sounds like exactly what we're looking for. So we'd use the service you create at BioHackathon in combination with the links from node to summary node (Version 2 - Summary Stacks) to be able to find for example, the beginning and end node at zoom level 3 corresponding to nucleotides 23,758,110 and 45,687,084. That'd be two calls to Erik's service and three lookups to iterate up zoom levels. I think we can put that question to rest now.

P.S. TAIR10 is just an example of a particular genome assembly, I could also have said Hg38 coordinate just to point out the coordinate concept is dependent on a particular assembly.

ekg commented 5 years ago

xg provides graph to genome path position mappings. That's pretty much the basis for these assembly coordinate systems.

6br commented 5 years ago

My goal: I have a python data interface to interact with xg server on version 1.

josiahseaman commented 5 years ago

Toshiyuki: "We have several candidates to worth trying as a backend data store. First I tried xg. Erik will implement a server version of xg, but it is not yet available. So, I decided to use a binary version of xg so far. Next, we need to interact with xg using GFA format. I searched the library to represent gfa format, and I found gfapy. But, due to dialects of GFA format, the output of gfapy and xg is different in several cases. Also, there is no interface to compare gfa formats. I implemented an initial version of interface between gfa and xg (gfa.py). I am not satisfied with the function of gfapy itself. Another approach is to have our own data format in python class (graph.py). If we select this approach, I need to convert them into gfa format to store in xg format or other file formats. We first need to choose two approach (1) "

josiahseaman commented 5 years ago

This week:

josiahseaman commented 5 years ago

As I was working on commit 098b4cbcf8ceb0375a0138ddb084566d6934fe79 I was taking notes for @6br . There were a couple of decisions I wasn't sure about. The big change was the lack of slices in Graph class. Instead, everything is just Nodes and Paths. I ended up keeping dictionaries of both which should be replaced with a database lookup in the future.

I managed to replace GFA.to_graph with only this:

    def to_graph(self):
        # Extract all paths into graph
        path_names = [p.name for p in self.gfa.paths]
        graph = Graph(path_names)  # Paths can be empty at start
        for path in self.gfa.paths:
            for node in path.segment_names:
                graph.append_node_to_path(node.name, node.orient, path.name)
        for segment in self.gfa.segments:
            graph.nodes[segment.name].seq = segment.sequence
        return graph

This build the full sequence for all accessions and contains the full graph structure. But it doesn't do slices or topological sort. The GFA lines: L 1 + 2 + 0M have not yet been addressed.

Why is to_gfa a @property? There's nothing incorrect, it just surprised me.

The future of graph tests

My mockup declaration format has reached the end of its life. You really can't declare inversions or any profoundly non-linear structure with this format: [['CAAATAAG', {x, y, z}], ['A', {y, z}, 'G', {x}], ['C', {x, y, z}], ['TTG', {x, y, z}], ['A', {z}, 'G', {x, y}], ['AAATTTTCTGGAGTTCTAT', {x, y, z}], ['T', {x, y, z}], ['ATAT', {x, y, z}], ['T', {x, y, z}], ['CCAACTCTCTG', {x, y, z}]]. It's useful but adding strands to the path mentions is not equivalent to having strands in paths when traversing nodes. We may just want to have a bunch of pickle files from here on out to support our tests. When we first make a test, we use a debugger to look at the structure and verify it by hand, then pickle it to a file and compare against the pickle in our test. The test test_load_gfa_to_graph() isn't passing but that is because it has nothing to compare to. As far as I can tell, the gfa.to_graph function is working correctly.

6br commented 5 years ago

Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.

As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.

pathA: 1+, 2+, 3+, 4+ 
pathB: 1+, 3+, 2+, 4+

In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.

josiahseaman commented 5 years ago

Node should definitely have an ID, it's currently the key in my node dictionary. I'm uncomfortable with this implementation because it should be a database lookup, but we haven't started django or flare yet. We need to get your branch and mine merged ASAP. If you've decided on flare, we need it for node id tracking.

In going to subclass Graph to SlicedGraph, so we can separate sort and slicing to it's own optional step.

On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:

Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.

As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.

pathA: 1+, 2+, 3+, 4+

pathB: 1+, 3+, 2+, 4+

In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .

On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:

Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.

As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.

pathA: 1+, 2+, 3+, 4+

pathB: 1+, 3+, 2+, 4+

In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .

josiahseaman commented 5 years ago

Minor note: node name should be have + that's path specific, not specific to the node.

On Wed, Jul 10, 2019, 11:21 Josiah Seaman josiah.seaman@gmail.com wrote:

Node should definitely have an ID, it's currently the key in my node dictionary. I'm uncomfortable with this implementation because it should be a database lookup, but we haven't started django or flare yet. We need to get your branch and mine merged ASAP. If you've decided on flare, we need it for node id tracking.

In going to subclass Graph to SlicedGraph, so we can separate sort and slicing to it's own optional step.

On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:

Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.

As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.

pathA: 1+, 2+, 3+, 4+

pathB: 1+, 3+, 2+, 4+

In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .

On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:

Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.

As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.

pathA: 1+, 2+, 3+, 4+

pathB: 1+, 3+, 2+, 4+

In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .

6br commented 5 years ago

I confirmed modifications on NodeTraversal and PathIndex.