vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

semantics of node and edge deletion in MutablePathDeletableHandleGraph #2106

Open ekg opened 5 years ago

ekg commented 5 years ago

In dsgvg, I'm working on a MutablePathDeletableHandleGraph. I'd like to clear up the semantics of what happen when we delete nodes and edges that are contained in paths in this kind of handle graph.

I'd like the path sequences to somehow continue to exist, but maybe refer to a kind of hidden node or the literal sequence of the fragment of the path overlapping the node that has been removed. Conceptually, this helps when implementing various graph-modifying algorithms, but maybe it doesn't need to be exposed through the API.

It's not clear to me that there is as much of a problem when removing edges. The paths and the edges both provide topology information.

/**                                                                                                                                                                                   
 * This is the interface for a graph which is deletable and which has paths which are also mutable.                                                                                   
 */
class MutablePathDeletableHandleGraph : virtual public MutablePathHandleGraph, virtual public DeletableHandleGraph {

    // No extra methods. Deleting a node or edge that is contained in a path is undefined behavior.                                                                                   
};
jeizenga commented 5 years ago

My unease with this approach is that it also implicitly changes the semantics of paths, not just deletions. That is, if we don't require the paths to take nodes/edges that are in the graph, then they aren't really paths in a graph theoretic sense. I smell problems coming from that.

Perhaps it would be useful to consider what our use cases are for deletion in the first place. That might help us decide what we need to support.

adamnovak commented 5 years ago

The Path Protobuf objects have always supported this sort of thing. The HandleGraph paths are really more like our threads; they're required to be fully embedded.

If you remove a node that's used by a path, maybe the path has to (or can) become an alignment? Do we want some way of storing a set of alignments in a handle-y way?

ekg commented 5 years ago

I think we just need to store the inserted sequence in the path to maintain these semantics. I'm doing this in dsgvg but in GFA there is no mechanism to express this. I guess in the cigar strings, if they are extended, it is possible to write the non graph sequences.

On Wed, Feb 13, 2019, 01:36 Adam Novak <notifications@github.com wrote:

The Path Protobuf objects have always supported this sort of thing. The HandleGraph paths are really more like our threads; they're required to be fully embedded.

If you remove a node that's used by a path, maybe the path has to (or can) become an alignment? Do we want some way of storing a set of alignments in a handle-y way?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2106#issuecomment-463003865, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4ER_TMdbZSwLwxwfr2UGGicWVFXRiks5vM14HgaJpZM4a2SWv .

adamnovak commented 5 years ago

With what we have now, the undefined behavior on node delete is covering up the fact that the only sensible things to do in the interface we have are to destroy the entire path or find a route around the deleted node to connect things up, and neither of those is a really good solution.

But I don't want to have sequences exist in the paths, as separate from nodes. It complicates the traversal process (because at any point you could hit a non-node), and it makes paths a substantially different kind of thing than they have been up until now.

I think instead we should solve this problem by implementing subpaths, so that a path in the graph can correspond to just a subrange of a named path (like "chr2:238443-239443"). This would let us have the primary path in a graph that doesn't actually have all the nodes it visits (like a sample subgraph), and I think it has clearer semantics than letting raw sequence be part of a path. Node removal is easy: you cut the subpath in two (or more, depending on how many occurrences of the node there were), and the new subpaths represent the original subpath's relationship to the parts of the graph that are still present. There's no need to deal with sequences that don't belong to a node, or to serialize them. You also don't need to carry around all the intervening sequence if you are working with a subgraph that touches a long path at two distant points. The extension to the PathHandleGraph API is adding another level of containment: you iterate over paths, and then subpaths (ranges?), and then occurrences, and each subpath has a starting offset associated with it.

GFA1 doesn't really have a way to express this, beyond appending the coordinate ranges to the name of the subsetted path. But I think that's prettier than extending CIGAR, and the result is going to be legal GFA.