kitchensjn / tskit_arg_visualizer

Interactive visualization method for ancestral recombination graphs
MIT License
11 stars 3 forks source link

Mapping multiple tskit edges with the same parent/child to a single graph edge #42

Closed hyanwong closed 11 months ago

hyanwong commented 12 months ago

There is an argument to be had about how to represent multiple tskit "edges" that have the same parent and child node IDs. These could either be overlaid as separate graph edges (i.e. a tskit edge == a graph edge in a "MultiDiGraph" in networkx terminology), or they could be mapped to a single edge with multiple intervals (i.e. multiple tskit edges could map to the same graph edge, and the graph is a simple DiGraph whose edges can contain a list of intervals).

The second one seems like it might work better for viz, even though you lose the exact 1:1 correspondence between tskit edges and graph edges. It might be worth getting input from other tskit developers here...

fbaumdicker commented 12 months ago

I would strongly support the addition of such a mapping of single edges with multiple intervals. This would enable a bunch of neat subsequent features, including highlighting all intervals when hovering over the edges and the visualization of edge importance or weights.

hyanwong commented 12 months ago

Jerome is keen on representing the ARG as a DiGraph, not a MultiDiGraph, as it makes the graph theoretical aspects somewhat easier. I.e. he agrees with you here @fbaumdicker.

In retrospect, it seems perhaps unfortunate to call each separate interval an "edge" in tskit.

jeromekelleher commented 12 months ago

Yep, it's probably best to think of tskit edges as being denormalised for efficiency purposes. So the model is that there's a graph with a set of intervals associated with each edge. Tskit chooses a particular way of encoding that.

hyanwong commented 12 months ago

Yep, it's probably best to think of tskit edges as being denormalised for efficiency purposes. So the model is that there's a graph with a set of intervals associated with each edge. Tskit chooses a particular way of encoding that.

Yes, I like this take.

kitchensjn commented 12 months ago

I created a new branch linked to this issue. Previously, each edge in the visualizer had a left and right attribute. That has now been changed to a bounds attribute. bounds is a list of lists (ex. [[left,right],[left,right],...]". Now, edges with the same parent/child combos are stored as a single edge. Tree highlighting then checks whether any of the bounds match.

Screenshot 2023-09-21 at 11 55 19 AM Screenshot 2023-09-21 at 11 57 29 AM

One sticking point with completely converting to a DiGraph instead of the MultiDiGraph is how to handle "diamonds". At least when edge_type="ortho", I think it's clearest if these as are kept as separate edges. When edge_type="line", the two edges will always be on top of one another so I could see arguments for combining them or providing a better way to show that there are two edges present. Currently, the visualizer is using the parent/child combos determined by tskit IDs (2-RE-node ARG), so edges on either side of a diamond are kept separate.

fbaumdicker commented 12 months ago

Hmm, I see your point here. Usually, I think it would be natural to combine edges that are actually the same ancestral lineage and keep edges apart that are different individuals/lineages. But not sure what the others think about this.

hyanwong commented 12 months ago

I think this is a (minor) defect in the tskit spec (see https://github.com/tskit-dev/what-is-an-arg-paper/issues/70#issuecomment-1032552476). Ideally we would have a MultiDiGraph in which each edge connecting a parent to a child can contain several (potentially non-contiguous) intervals. Each "edge" of the graph (which could correspond to multiple tskit-edges) would then represent a single transmission path. In the case of a diamond, there would be 2 edges linking the same parent and child, hence it would make use of the MultiDiGraph rather than the simple Digraph structure. I'm not sure what you would do in the absence of information, but I suppose you might stick each interval on a separate edge.

Regardless, I agree that the answer for visualisation purposes isn't obvious. However, in the (special) case of an msprime-format recombination node, with 2 RE parents, I think we can infer that the left and right paths are different, and so make a specific exclusion for that purpose, showing the LHS image you have above.

In this case, my ideal solution for the edge_type=line case would be to have 2 curved edges to represent a diamond (I guess that might be tricky, though)

Screenshot 2023-09-22 at 17 09 41

Edit - it would also be useful to document this. Here's some example text to use:

In most cases, if there are multiple transmitted regions (tskit "edges") with the same parent and child, these are shown as a single graph edge composed of several intervals. Note that a recombination node pair (i.e. a pair of nodes marked with the IS_RE_NODE flag) is displayed as a single node in the visualiser, but counts as two nodes for this purpose. In other words, two graph edges are displayed immediately rootward of a the paired node.

kitchensjn commented 12 months ago

For an ARG in the msprime-format, the edges will be kept separate if they have different child IDs. So in the LHS image, edges 21-17 and 21-18 would be stored separately, but edges 17-16 and 18-16 will be combined into a single edge as they are referring to the same path of inheritance. This is currently implemented on the branch alongside stroke-width stylization changes for #43.

The visualizer now always combines edges with the same parent/child combos. This means that if you pass in an ARG in 1-RE node format, diamonds will be lost. To prevent this, we will need slightly different functions between the two ARG formats. One step in making the visualizer more robust outside of the 2-RE node format.

Edit: Missed @hyanwong's comment about curved edges for edge_type="line" and ended up essentially restating the same thing. I think that's a good solution. I'll make a new issue specifically for adding that feature.

kitchensjn commented 11 months ago
Screenshot 2023-09-26 at 10 42 08 AM

Added arced edges for diamonds. Still need to add documentation for how the edge remapping is being handled.

kitchensjn commented 11 months ago

This has now been merged to main (don't know why it didn't automatically close)!