This is one of those things I skipped over while trying to get the new version out, but it'd be really good to add in when I get some time. With #202 rearing its head, now might be a good time to sort this junk out.
Briefly, the current way we store data in Python (both biological metadata, e.g. length/coverage/GC content, and internal stuff like layout positions) for nodes/edges is ... "ad hoc", to put it politely. I think node data is currently stored in the AssemblyGraph.digraph graph (in the data dictionary), and edge data is usually stored in the AssemblyGraph.decomposed_digraph graph (and then distributed throughout the subgraphs of each Pattern, as needed).
We should really try to set things up so the AssemblyGraph holds this data directly (or maybe have another class hold it, idk): this would simplify the process of setting and getting data alike, and would make pattern detection, layout, rotation/scaling, and data exporting insanely easier.
Necessary things for this system
Easy iteration: No need to manually do DFS on the graph's patterns or whatever in order to find all the edges. Should be easy to just go through all edges in the graph and get their data.
Easy to update internal data: Should be easy to say that a node / edge / etc. is now "in a pattern". Should be easy to get both the original source/target of an edge and the source/target of an edge in the decomposed digraph, as well as to find edge(s) based on these adjacencies.
Allows for multi-edges: not necessarily in the original graph, but definitely in the decomposed digraph. At the very least, this shouldn't fail silently when trying to add a node/edge that already exists -- it should fail extremely loudly, so that #202-esque bugs don't happen again.
Nice-to-haves
Separation of user-specified data and internal data: having stuff like user-defined fields named x shouldn't cause a problem. (This will be harder than it seems, since we'll have to fix this kinda stuff both in the python and JS code... For now being overly restrictive is the easiest solution.)
This is one of those things I skipped over while trying to get the new version out, but it'd be really good to add in when I get some time. With #202 rearing its head, now might be a good time to sort this junk out.
Briefly, the current way we store data in Python (both biological metadata, e.g. length/coverage/GC content, and internal stuff like layout positions) for nodes/edges is ... "ad hoc", to put it politely. I think node data is currently stored in the
AssemblyGraph.digraph
graph (in the data dictionary), and edge data is usually stored in theAssemblyGraph.decomposed_digraph
graph (and then distributed throughout the subgraphs of each Pattern, as needed).We should really try to set things up so the AssemblyGraph holds this data directly (or maybe have another class hold it, idk): this would simplify the process of setting and getting data alike, and would make pattern detection, layout, rotation/scaling, and data exporting insanely easier.
Necessary things for this system
Easy iteration: No need to manually do DFS on the graph's patterns or whatever in order to find all the edges. Should be easy to just go through all edges in the graph and get their data.
Easy to update internal data: Should be easy to say that a node / edge / etc. is now "in a pattern". Should be easy to get both the original source/target of an edge and the source/target of an edge in the decomposed digraph, as well as to find edge(s) based on these adjacencies.
Allows for multi-edges: not necessarily in the original graph, but definitely in the decomposed digraph. At the very least, this shouldn't fail silently when trying to add a node/edge that already exists -- it should fail extremely loudly, so that #202-esque bugs don't happen again.
Nice-to-haves
x
shouldn't cause a problem. (This will be harder than it seems, since we'll have to fix this kinda stuff both in the python and JS code... For now being overly restrictive is the easiest solution.)