Graph Summarization Development Options

6br commented 5 years ago

Development Options

Extend VG / XG with our new Graph Summarization Concepts
Summary is stored in our database, XG still used at nucleotide level
Import entire XG graph into our database, then add summary layers

ekg commented 5 years ago

You might think about subclassing the XG object somehow, or adding additional indexes on top of / independently of it.

6br commented 5 years ago

I am considering how to implement web backend. There are three choices described above. Graph summarization adds several layers on top of the original graph genome. The lowest layer is the original "full" graph, and the highest layer is a kind of "bird's eyes view". and each layer has a smaller graph than the lower layer. Each node of a smaller graph has a pointer to several nodes of the lower layer. So, as a whole, the pointers become like a tree. All layers include an index to retrieve subgraphs. @ekg, what do you think about adding these layers into xg indices, or is there any other attempts to add such higher layers into vg? Currently, I am not sure whether these data structure is beneficial for other than visualization.

6br commented 5 years ago

Alternative idea is that we have our own database for summary layers. To summarize graphs in the database, we need to prefetch all graphs into our database. The upside is that we can easily update our data model for visualization, but the downside is that might be redundant and might not be compatible with new versions.

ekg commented 5 years ago

@6br I think we should build from, not into the xg index. We can extend it with these kinds of indexes, maybe be precomputing summarized views and storing them in separate indexes, then translating between them as needed with a separate system that links them together.

I think the idea of keeping your own database is fine too, and basically the same as this, it all depends on how much you optimize it to this particular application.

josiahseaman commented 5 years ago

I believe these summary graphs will be useful for more than just visualization. Fundamentally, what I'm designing it to do is to group together haplotype blocks and iteratively find informative boundaries. That'd be a very useful precompute step to any other analysis. Working with less nodes that are less noisy would be a good place to start for other researchers.

The difference between 1 and 2 is most obvious when it comes to updates. When XG standard format is updated, does the summary graph code also get updated before a release? Do we lock the version numbering together?

josiahseaman commented 5 years ago

Further discussion in the following month has led pretty conclusively to an approach where we support import from both XG and GFA. Our database can handle the "new feature concepts" of links between summary layers and aggregation of paths into haplotypes and ribbons. Then we can export to XG file formats and still be intercompatible with other tools. This seems the best of both worlds and doesn't put too much development burden on being able to add features. Each summarization layer can be exported as its own Graph but since other tools don't contain the concept of linking graphs in a hierarchy, they won't be linked. This means even if you don't care about our visualizations, it'll still be a useful tool for Graph scrubbing or summarization.

We can have a followup discussion about the feature differences between GFA and XG and whether or not either of those technical details conflict with something we're doing in the database concepts. I personally think that discussion will be more clear once we have a functioning tool. So I'll skip on speculation for now and just implement what is possible and see if there's any substantial snags along the way.

ekg commented 5 years ago

To store the graphs in RAM efficiently you might be looking at something very similar to the various HandleGraph implementations like xg. GFA is only an interchange format. The HandleGraphs are self indexes that allow random access by any feature of the graph.

XG presents a HandleGraph interface, but is unique in allowing random query of path (sequence or reference) positions. We can find what paths are at a given node in O(1) time, and what node is at a given path position in something like O(log N) time where N is the size of the path. The graph topology is packed into a single vector that supports efficient random lookup and cache-efficient O(1) relativistic traversal.

At the BioHackathon I will be implementing a server API on top of xg. It will expose a HandleGraph API. Let me know if y'all have any ideas for queries needed by the visualization service. I can add them to the API if they are not already implemented there.

On Thu, Aug 15, 2019, 13:09 Josiah Seaman notifications@github.com wrote:

Closed #5 https://github.com/graph-genome/vgbrowser/issues/5.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/5?email_source=notifications&email_token=AABDQEPR7QYLOZJD4E5U5WDQEU2MZA5CNFSM4HXKH6WKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTCFIAYQ#event-2559213666, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEM4U4TNFSORKCO27MLQEU2MZANCNFSM4HXKH6WA .

josiahseaman commented 5 years ago

Thank you for your comment, Erik. HandleGraph is probably relevant to us. Does this require storing the whole graph in memory? I don't quite have the mental bandwidth to think through it all right now, so please allow me to think out loud. I'm pretty sure a database would have the same O(1) Node retrieval or O(log N) binary search on paths. The difference may be that a DB handles paging to memory automatically. Or there may be no difference at all. In which case, yeah, I don't want to reimplement the wheel.

The factor driving the database decision is that we'll need a database to contain links and concepts not handled by XG. Fundamentally, we need a place to add new feature. In order to have links between nodes in different summarization layers, I need to have two nodes in a DB and a link between them. In practical terms, that means every single Node needs to be present as a copy in the DB. Sure, I could skip storing sequence in them, even skip upstream and downstream connections. But at the point where you already have a data structure with every node, it seems you're 90% of the way there to just handling the whole dataset internally with an XG mediated import option.

Sebastian recently suggested we collaborate to build a standard file format for summary levels. If we had a summary node link map (which is just a tree) plus XG retrieval, we could technically have all the data without a database, though it would be in many different files. The reason I'd decide against a pure file solution comes down to migrations. If I have a database schema, Django automatically generates migrations scripts that update all data from any version. If we code file formats by hand, then all schema changes are breaking changes or a lot of development time goes into writing version migrations by hand and hoping you never make a mistake that corrupts your user data. It seems like with Import / Export from an internal database I get the best of both worlds, with clearly defined boundaries of responsibility.

ekg commented 5 years ago

I agree with you that a database is sufficient and vastly more flexible. The subtext here is that the size of the graphs can be very large if they are stored in uncompressed form. If we use even a handful of pointers and 64-bit integers per node or edge in the graph, we're going to run into storage costs in the terabyte range for just the 1000GP small variant graph. The implementations we've made keep the memory usage close to the 0-order entropy of the data while providing random access. They have to be customized for this particular application. I would suggest a kind of hybrid approach, where the links and annotations are stored in an overlay. However, if things are all being dropped into disk backed databases and performance isn't critical, then maybe there's no reason to go this route.

On Fri, Aug 16, 2019 at 11:57 AM Josiah Seaman notifications@github.com wrote:

Thank you for your comment, Erik. HandleGraph is probably relevant to us. Does this require storing the whole graph in memory? I don't quite have the mental bandwidth to think through it all right now, so please allow me to think out loud. I'm pretty sure a database would have the same O(1) Node retrieval or O(log N) binary search on paths. The difference may be that a DB handles paging to memory automatically. Or there may be no difference at all. In which case, yeah, I don't want to reimplement the wheel.

The factor driving the database decision is that we'll need a database to contain links and concepts not handled by XG. Fundamentally, we need a place to add new feature. In order to have links between nodes in different summarization layers, I need to have two nodes in a DB and a link between them. In practical terms, that means every single Node needs to be present as a copy in the DB. Sure, I could skip storing sequence in them, even skip upstream and downstream connections. But at the point where you already have a data structure with every node, it seems you're 90% of the way there to just handling the whole dataset internally with an XG mediated import option.

Sebastian recently suggested we collaborate to build a standard file format for summary levels. If we had a summary node link map (which is just a tree) plus XG retrieval, we could technically have all the data without a database, though it would be in many different files. The reason I'd decide against a pure file solution comes down to migrations. If I have a database schema, Django automatically generates migrations scripts that update all data from any version. If we code file formats by hand, then all schema changes are breaking changes or a lot of development time goes into writing version migrations by hand and hoping you never make a mistake that corrupts your user data. It seems like with Import / Export from an internal database I get the best of both worlds, with clearly defined boundaries of responsibility.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/5?email_source=notifications&email_token=AABDQEMZNGNEHK566PXKXQ3QEZ2Z3A5CNFSM4HXKH6WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4OGYPA#issuecomment-521956412, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEMMO7UJL7WLP3P5KY3QEZ2Z3ANCNFSM4HXKH6WA .

josiahseaman commented 5 years ago

Hi Erik, I will take your recommendation seriously. I hadn't done the storage size calculations yet, but I'd wager the DB is 2-4x your optimized size. In order to make an overlay for our summary connections we're going to need those DB objects anyways. I would plan that if storage or performance becomes an issue that we'll replace the DB object properties that get sequence level nodes with a method that invisibly fetches this data from HandleGraph instead. This still leaves a slow DB implementation as a first development step, then later becomes an interface to HandleGraph. Does that sound like a reasonable approach for feature development versus performance?

graph-genome / graph_summarization

Graph Summarization Development Options #5

Development Options