Closed hyanwong closed 11 months ago
I would strongly support the addition of such a mapping of single edges with multiple intervals. This would enable a bunch of neat subsequent features, including highlighting all intervals when hovering over the edges and the visualization of edge importance or weights.
Jerome is keen on representing the ARG as a DiGraph, not a MultiDiGraph, as it makes the graph theoretical aspects somewhat easier. I.e. he agrees with you here @fbaumdicker.
In retrospect, it seems perhaps unfortunate to call each separate interval an "edge" in tskit.
Yep, it's probably best to think of tskit edges as being denormalised for efficiency purposes. So the model is that there's a graph with a set of intervals associated with each edge. Tskit chooses a particular way of encoding that.
Yep, it's probably best to think of tskit edges as being denormalised for efficiency purposes. So the model is that there's a graph with a set of intervals associated with each edge. Tskit chooses a particular way of encoding that.
Yes, I like this take.
I created a new branch linked to this issue. Previously, each edge in the visualizer had a left
and right
attribute. That has now been changed to a bounds
attribute. bounds
is a list of lists (ex. [[left,right],[left,right],...]
". Now, edges with the same parent/child combos are stored as a single edge. Tree highlighting then checks whether any of the bounds match.
One sticking point with completely converting to a DiGraph instead of the MultiDiGraph is how to handle "diamonds". At least when edge_type="ortho"
, I think it's clearest if these as are kept as separate edges. When edge_type="line"
, the two edges will always be on top of one another so I could see arguments for combining them or providing a better way to show that there are two edges present. Currently, the visualizer is using the parent/child combos determined by tskit IDs (2-RE-node ARG), so edges on either side of a diamond are kept separate.
Hmm, I see your point here. Usually, I think it would be natural to combine edges that are actually the same ancestral lineage and keep edges apart that are different individuals/lineages. But not sure what the others think about this.
I think this is a (minor) defect in the tskit spec (see https://github.com/tskit-dev/what-is-an-arg-paper/issues/70#issuecomment-1032552476). Ideally we would have a MultiDiGraph in which each edge connecting a parent to a child can contain several (potentially non-contiguous) intervals. Each "edge" of the graph (which could correspond to multiple tskit-edges) would then represent a single transmission path. In the case of a diamond, there would be 2 edges linking the same parent and child, hence it would make use of the MultiDiGraph rather than the simple Digraph structure. I'm not sure what you would do in the absence of information, but I suppose you might stick each interval on a separate edge.
Regardless, I agree that the answer for visualisation purposes isn't obvious. However, in the (special) case of an msprime-format recombination node, with 2 RE parents, I think we can infer that the left and right paths are different, and so make a specific exclusion for that purpose, showing the LHS image you have above.
In this case, my ideal solution for the edge_type=line
case would be to have 2 curved edges to represent a diamond (I guess that might be tricky, though)
Edit - it would also be useful to document this. Here's some example text to use:
In most cases, if there are multiple transmitted regions (tskit "edges") with the same parent and child, these are shown as a single graph edge composed of several intervals. Note that a recombination node pair (i.e. a pair of nodes marked with the IS_RE_NODE flag) is displayed as a single node in the visualiser, but counts as two nodes for this purpose. In other words, two graph edges are displayed immediately rootward of a the paired node.
For an ARG in the msprime-format, the edges will be kept separate if they have different child IDs. So in the LHS image, edges 21-17 and 21-18 would be stored separately, but edges 17-16 and 18-16 will be combined into a single edge as they are referring to the same path of inheritance. This is currently implemented on the branch alongside stroke-width stylization changes for #43.
The visualizer now always combines edges with the same parent/child combos. This means that if you pass in an ARG in 1-RE node format, diamonds will be lost. To prevent this, we will need slightly different functions between the two ARG formats. One step in making the visualizer more robust outside of the 2-RE node format.
Edit: Missed @hyanwong's comment about curved edges for edge_type="line"
and ended up essentially restating the same thing. I think that's a good solution. I'll make a new issue specifically for adding that feature.
Added arced edges for diamonds. Still need to add documentation for how the edge remapping is being handled.
This has now been merged to main (don't know why it didn't automatically close)!
There is an argument to be had about how to represent multiple tskit "edges" that have the same parent and child node IDs. These could either be overlaid as separate graph edges (i.e. a tskit edge == a graph edge in a "MultiDiGraph" in networkx terminology), or they could be mapped to a single edge with multiple intervals (i.e. multiple tskit edges could map to the same graph edge, and the graph is a simple DiGraph whose edges can contain a list of intervals).
The second one seems like it might work better for viz, even though you lose the exact 1:1 correspondence between tskit edges and graph edges. It might be worth getting input from other tskit developers here...