Closed 6br closed 5 years ago
summary_parent
: a node in the next SummaryLayer up. alternative_allele
ForeignKey(Node)alternative_overlap_bp
number of base pairs the two alleles agree onalternative_overlap_bp
can be added to ensure percentages don't drift based on nodes of differing sizes.((Updated by Josiah 2019-08-15: Clarified Ribbon vs. Path. Added AggregateTraversal))
coverage_count
The number of specimens inside this ribbon that actually traverse this node. From this can be calculated the number of specimens bypassing a node.For the first issue, I prefer to store a Path index into Node. Because it is simpler than using Intersection object.
For the second issue, checkpoints seem ad-hoc ideas to represent nucleotide coordinates on paths, but they are useful when users wish to understand what part of the whole genome do they see. Suppose if paths contain a list of segments with a checkpoint, and if you want to merge two segments across the checkpoint, then it becomes unclear how to handle the checkpoint. However, if you specify checkpoints on the farthest view (aka. bird's eyes), then the checkpoints can propagate to all zoom layers including nucleotide zoom level because we can assume that we do not merge nodes on zooming up. I wonder if there is a good way to represent checkpoints in this spec.
Check points are ad hoc, but they may scale very nicely all the same. To be clear, checkpoints are just a speed optimization feature. It's possible to start at the beginning of a path and count every nucleotide in every node until you reach the index you are looking for. Checkpoints are just a precomputed sum stored haphazardly throughout the path. So you can delete a checkpoint and nothing bad happens, it just takes fractionally longer to calculate. As the graph is iteratively summarized, checkpoints would become increasingly dense and outnumber nodes, which is bad. So every time a merge operation spans a checkpoint, just delete the checkpoint and append the two lists together. That way there's roughly the same ratio of checkpoint to node. The nice thing about preserving checkpoints at large scale that were precomputed at nucleotide scale is that there's no rounding error propagation. They'll stay as accurate as the graph allows.
The competitor solution is to have a "coordinate mapping" service that's a completely separate data structure. This may be cleaner in the long run. Either way, we don't have to decide yes/no on checkpoints immediately. We need to decide #5 first.
@ekg Does VG already have a better solution for the problem of mapping old TAIR10 coordinates onto a graph in random access?
@josiahseaman what's a TAIR10 coordinate?
In vg we have the xg index, which handles coordinate mappings of the graph. I've submitted a proposal for the upcoming Biohackathon that extends this model with a server interface that could be used by graph genome browsers.
Great! That sounds like exactly what we're looking for. So we'd use the service you create at BioHackathon in combination with the links from node to summary node (Version 2 - Summary Stacks) to be able to find for example, the beginning and end node at zoom level 3 corresponding to nucleotides 23,758,110 and 45,687,084. That'd be two calls to Erik's service and three lookups to iterate up zoom levels. I think we can put that question to rest now.
P.S. TAIR10 is just an example of a particular genome assembly, I could also have said Hg38 coordinate just to point out the coordinate concept is dependent on a particular assembly.
xg provides graph to genome path position mappings. That's pretty much the basis for these assembly coordinate systems.
My goal: I have a python data interface to interact with xg server on version 1.
Toshiyuki: "We have several candidates to worth trying as a backend data store. First I tried xg. Erik will implement a server version of xg, but it is not yet available. So, I decided to use a binary version of xg so far. Next, we need to interact with xg using GFA format. I searched the library to represent gfa format, and I found gfapy. But, due to dialects of GFA format, the output of gfapy and xg is different in several cases. Also, there is no interface to compare gfa formats. I implemented an initial version of interface between gfa and xg (gfa.py). I am not satisfied with the function of gfapy itself. Another approach is to have our own data format in python class (graph.py). If we select this approach, I need to convert them into gfa format to store in xg format or other file formats. We first need to choose two approach (1) "
This week:
As I was working on commit 098b4cbcf8ceb0375a0138ddb084566d6934fe79 I was taking notes for @6br . There were a couple of decisions I wasn't sure about. The big change was the lack of slices in Graph class. Instead, everything is just Nodes and Paths. I ended up keeping dictionaries of both which should be replaced with a database lookup in the future.
I managed to replace GFA.to_graph with only this:
def to_graph(self):
# Extract all paths into graph
path_names = [p.name for p in self.gfa.paths]
graph = Graph(path_names) # Paths can be empty at start
for path in self.gfa.paths:
for node in path.segment_names:
graph.append_node_to_path(node.name, node.orient, path.name)
for segment in self.gfa.segments:
graph.nodes[segment.name].seq = segment.sequence
return graph
This build the full sequence for all accessions and contains the full graph structure. But it doesn't do slices or topological sort. The GFA lines: L 1 + 2 + 0M
have not yet been addressed.
Why is to_gfa a @property? There's nothing incorrect, it just surprised me.
My mockup declaration format has reached the end of its life. You really can't declare inversions or any profoundly non-linear structure with this format: [['CAAATAAG', {x, y, z}], ['A', {y, z}, 'G', {x}], ['C', {x, y, z}], ['TTG', {x, y, z}], ['A', {z}, 'G', {x, y}], ['AAATTTTCTGGAGTTCTAT', {x, y, z}], ['T', {x, y, z}], ['ATAT', {x, y, z}], ['T', {x, y, z}], ['CCAACTCTCTG', {x, y, z}]]
. It's useful but adding strands to the path mentions is not equivalent to having strands in paths when traversing nodes. We may just want to have a bunch of pickle files from here on out to support our tests. When we first make a test, we use a debugger to look at the structure and verify it by hand, then pickle it to a file and compare against the pickle in our test. The test test_load_gfa_to_graph()
isn't passing but that is because it has nothing to compare to. As far as I can tell, the gfa.to_graph function is working correctly.
Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.
As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.
pathA: 1+, 2+, 3+, 4+
pathB: 1+, 3+, 2+, 4+
In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example,
Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))])
.
In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify
branch.
Node should definitely have an ID, it's currently the key in my node dictionary. I'm uncomfortable with this implementation because it should be a database lookup, but we haven't started django or flare yet. We need to get your branch and mine merged ASAP. If you've decided on flare, we need it for node id tracking.
In going to subclass Graph to SlicedGraph, so we can separate sort and slicing to it's own optional step.
On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:
Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.
As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.
pathA: 1+, 2+, 3+, 4+
pathB: 1+, 3+, 2+, 4+
In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .
On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:
Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.
As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.
pathA: 1+, 2+, 3+, 4+
pathB: 1+, 3+, 2+, 4+
In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .
Minor note: node name should be have + that's path specific, not specific to the node.
On Wed, Jul 10, 2019, 11:21 Josiah Seaman josiah.seaman@gmail.com wrote:
Node should definitely have an ID, it's currently the key in my node dictionary. I'm uncomfortable with this implementation because it should be a database lookup, but we haven't started django or flare yet. We need to get your branch and mine merged ASAP. If you've decided on flare, we need it for node id tracking.
In going to subclass Graph to SlicedGraph, so we can separate sort and slicing to it's own optional step.
On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:
Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.
As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.
pathA: 1+, 2+, 3+, 4+
pathB: 1+, 3+, 2+, 4+
In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .
On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama notifications@github.com wrote:
Thank you for your comments. Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.
As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.
pathA: 1+, 2+, 3+, 4+
pathB: 1+, 3+, 2+, 4+
In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]). In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/4?email_source=notifications&email_token=AARG2FDMYWETFHBFKP6WINTP6U3FPA5CNFSM4HW6S6I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSAFAY#issuecomment-509870723, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ .
I confirmed modifications on NodeTraversal and PathIndex.
((Updated by Josiah 2019-07-10)) ((Updated by Josiah 2019-08-15: Removed PathIndex and Slices))
Version 1 - Basics