kitchensjn / tskit_arg_visualizer

Interactive visualization method for ancestral recombination graphs
MIT License
11 stars 3 forks source link

Showing mutations on edges #14

Open hyanwong opened 1 year ago

hyanwong commented 1 year ago

There are various possibilities for showing mutations on edges (apart from not displaying them at all). Here are some visual suggestions, I'm sure there are more:

https://github.com/hyanwong/dynamic-ARG-viz/issues/2

kitchensjn commented 1 month ago
Mutation implementation ideas

Some initial thoughts for this issue and a two main questions that I've been thinking about.

For example, if we have a simple three sample tree (Top Tree):

1) Middle Tree - A single mutation tick can be added to the midpoint of the edge indicating that there is at least one mutation along this branch. Then when you hover over the tick it tells you about all of the mutations along that edge. This solution is the simplest visually, but does not clearly covey the number or timing of mutations. 2) Bottom Tree - Mutations are each given their own tick mark and the y-axis is expanded to include the timings of the mutations. This version is less reliant on popups, though that could still be useful for including more details.

For both of these methods, I don't think that the mutations should be draggable. There positions should update as you drag the nodes around but the mutations are not themselves nodes on the graph. More like interactive labels that have been added on top of the graph.

Prototype of visualizer with mutations

Here's a quick prototype of the Middle Tree version without popups as those haven't yet been added.

hyanwong commented 1 month ago

Thanks for this @kitchensjn. The prototype looks great. I definitely agree that mutations shouldn't be draggable.

Personally, I think the default should be to have separate ticks for each mutation. It is fairly important to see visually how many mutations there are along a branch, without hovering over a yellow box to see how many mutations it represents. However, if there are lots of mutations along an edge, I can see the advantage of collapsing them into one: if you feel this is important, I guess you could have a plotting parameter to switch between behaviours?

I think it would also be useful to (again optionally) specify mutation labels. But it could be that the default is not to show any.

For position along the line, note that sometimes mutations in a tree sequence have their own "time" value. Perhaps it would be useful to (optionally?) use this to the Y positions if the "y_axis_scale" is set to "time" or "log_time"? FYI that often doesn't look great in the tree-by-tree view, though. Otherwise evenly spaced along the line, or clustered in the middle of the line both seem reasonable options.

(edit: just re-read your comments: for the "rank" version in draw_svg() we ignore the times of mutations when calculating the y axis ticks, and simply place the mutation in the middle of each branch. I guess this could mean that mutations on different branches could be in the wrong rank order on the y axis, but 🤷 )

hyanwong commented 1 month ago

Re tick marks on the axis: in my experience, it's hard to connect the mutation symbol on the edge with a tick on the axis, so some sort of dynamic highlighting is probably necessary. Either you could (a) show ticks on the axis for all mutations, but highlight one when you mouse over the mutation on the edge or (b) you could simply reveal a yellow bar on the axis when you mouse over a mutation on an edge.

(b) is less cluttered, scales better to lots of mutations, and is probably more in-keeping with the rest of the visualiser. But (a) has the distinct advantage that you can mouse over a tick on the axis to highlight which mutation it corresponds to on the tree. So I'm not sure which is better.

kitchensjn commented 1 month ago

tskit_arg_visualizer-37

For this, I've added a mutations table that includes edge and timing information. This version with "rank" scaling takes into account the mutation timings when determining the vertical positioning. Then determines horizontal positioning and rotation based on the positions of the edge parent and child. This seems like the easiest way to show mutation ordering along an edge and avoid cluttering when there are many mutations on an edge that would be normally short leading to overlapping. I agree that the actually mutation timing is hard to pin down with the y axis being so far to the left, so I can test out your highlighting concept. This all works for "time" and "log_time" though mileage may vary with whether it looks good. It should be straightforward from here to add labels to each mutation, so that's my next goal for prototyping. Then will push a version for you to try out.

hyanwong commented 1 month ago

This looks great. I can completely see the rationale behind adding mutation times to the ranks. But what happens if the mutation times are tskit.UNKNOWN_TIME?

Re highlighting, I was actually meaning on the X axis. You might well want to see the position of a mutation (site) on the X axis. I'm not sure how you would show this. Maybe a yellow vertical tick on the genome bar, of half the height of the bar itself?

kitchensjn commented 1 month ago

tskit_arg_visualizer-38

So my idea for when times are unknown was to provide a "plotting time" to any mutations with a missing time. The plotting time is approximately halfway in time between the parent and child node times for the edge with a little randomness to separate when there are multiple mutations. Then to distinguish the mutation times that we don't actually know, I removed there y-axis tick. This lets you plot mutations with times alongside those without while avoiding overlapping. But I'm not completely sold as the half position in time is not necessarily the half position on the edge in the event that there are other nodes within that time range for rank scale. (Note: this strategy will need to be modified for time and log_time scales as currently all of the mutations with unknown time end up stacked on top of one another.)

kitchensjn commented 1 month ago

tskit_arg_visualizer-40

So if all of mutations have unknown times (or if you want to ignore times), you can position all of the mutations evenly on the edge. The spacing should be able to be modified if you preferred the mutations be clumped in the center of the edge as you have shown in https://github.com/hyanwong/dynamic-ARG-viz/issues/2. Here, I colored the mutations orange instead of yellow just to indicate that the timing on the branch is unimportant. You'll need to set an appropriate figure height to avoid overlapping the mutation marks. I think this is the cleanest version yet when you don't care about timing.

The issue is mixing mutations with known times alongside those without times. In those cases, I still think the previous version is better, especially when there's a concern about overlapping. Maybe having both styles is the way to go forward...

kitchensjn commented 1 month ago

tskit_arg_visualizer-43

Added labels within the mutation symbols and corresponding lines to the genome bar. To reduce the amount of text written to the screen, I gave each mutation site an index rather than using its full genome position, so the mutation labels are ::. This should match the site index from the tskit table if those are sorted (I think that is forced?). Font sizes have to be made quite small unfortunately to fit everything in without overlaps. For the genome bar ticks, I alternated the site labels top versus bottom just to give more spacing between neighboring labels. Colors will need to be played with to figure out what is most legible with the smaller fonts.

hyanwong commented 1 month ago

Nice! Yes, the position index is ordered, so that should be OK. Site positions are guaranteed to be in order. Note that mutation IDs might not be the same as site position IDs, however, so it might be worth checking an example that has multiple mutations at the same site. Personally I find the yellow axis tick marks a little too emphasised, but my previous suggestion of dropping the yellow lines to half-height (which would have the advantage of not completely obscuring the tree boundaries) means it's harder to alternate the site numbers above and below.

I wonder if, given the interactive nature of the thing, you can get away with omitting the site_position_index entirely, and just revealing it on hover? If you can make it interactive so that hovering over the mutation in the graph highlights the tick on the x axis, and vice-versa, that might be enough to make the link between the mutation and its site position? Then you could have the yellow lines as half-height. But these are all just personal aesthetic opinions. I think the most important thing is to be able to change the mutation text (including removing it entirely) and the yellow text on the x axis ticks.

Maybe worth seeing if other people from your group have suggestions? Do you want me to post it on one of the tskit forums to get feedback?

kitchensjn commented 1 month ago

I'll make the position index a direct reference to the site ID to keep things consistent.

Any feedback others had would be great! Especially with how interactive versus static the figure should be. I'm definitely interested in leaning more towards interactive with popups and things like that to keep it less cluttered, though it would make it less applicable for generating publication figures. Really depends how it would be most helpful.

hyanwong commented 1 month ago

For publication, the only times I've wanted to visualise a graph are the covid case, which requires restricting to a subgraph (indeed, I think most use-cases of your visualiser will need either to be small examples for teaching, or subgraphs).

In this case, there were multiple mutations along each edge, so it was useful to either summarise them into a single number, or to just have a list of positions / state changes stacked up along each edge. I'm not sure how I would get something similar in tskit_arg_visualizer? Admittedly the viz we used could do with improvements though.

image

p.s. is your current approach in a PR somewhere? If so I can see what it looks like on something similar.

kitchensjn commented 1 month ago

Added "mutations" branch for you to check out!

hyanwong commented 1 month ago

Thanks! Just made a comment about slow conversion in #92 . I think maybe it's a useful addition to allow progress bars, even if you don't officially document it yet?

hyanwong commented 1 month ago

There are some slightly weird effects going on here. I suspect it's because the mutations are placed at exactly the same time as the nodes below them: Screenshot 2024-09-19 at 14 07 16

kitchensjn commented 1 month ago

I forgot that I used mutation count to determine the positioning of the mutation marks along the edge when ignore_mutation_times=True. In the Pull Request discussion, I told you to comment out the lines that calculated the mutation count but that has caused the scaling to be off (see that some mutation marks look disassociated from any of the edges). In 31c4be1, I've updated it to calculate the mutation count in draw() and draw_node(). With draw_node(), this should be much faster than before because you are plotting only a subset of the edges rather than the entire ARG. Hopefully that fixes it, but let me know if it turns out to be something else!

hyanwong commented 1 month ago

Ah yes, much better, thanks!

Some way to make the mutation rectangles along the edges a bit smaller and more discreet (e.g. without labels) might bee handy for larger instances, like the one below, might be good. Perhaps I can simply remove the text in the accompanying data frame?

Screenshot 2024-09-20 at 07 27 31
kitchensjn commented 3 weeks ago

tskit_arg_visualizer-50

d3arg.draw(
    ...,
    show_mutations=True,
    ignore_mutation_times=True,
    include_mutation_labels=False
)

Messy ARG example, but I've added a few more parameters to draw() and draw_node(); I've included them all here, though the last two are the default values. You can hover for the mutation ticks which show you the mutations along the chromosome rather than showing all of them at once. Clicking the mutation locks it as active so that you can look at more than one at a time or take a picture.

We could potentially give the user the ability to change the mutation label (as we did for node labels), but an important note is that the size of the mark would need to dynamic as well to make sure it fits around whatever label they choose. Alternatively, we could just have the labels outside of the tick marks to avoid this problem. It kind of depends how often you would be modifying specific labels versus removing all them as include_mutation_labels=False does.

hyanwong commented 3 weeks ago

That's fabulous. Thanks a lot. I'll have a play.

kitchensjn commented 3 weeks ago

With some more testing, I found it sometimes hard to remember which mutation corresponded to the tick on the genome bar since they are all red without a corresponding label. Latest commit picks a random color every time you click the mutation so that it's easier to differentiate.

kitchensjn commented 1 week ago

Alright, yesterday I had some time to work on popups (#50) for the mutations. Relevant to mutations, I've removed the label on the genome bar mark and instead show that label in the tooltip. I've also removed the ability to click and lock mutations because it plays strangely with the tooltip. So far, I think that this gives us the most flexibility in what is displayed without being clutter, and I am partial to removing the include_mutation_labels parameter (I at least don't see a scenario where that would be more useful than the tooltip). Tooltip has been added to the mutation branch if you are interested in trying it out. It should work both in standalone and notebooks.

hyanwong commented 1 week ago

Good work. Re mutation labels, it is definitely useful to be able to statically display the details for all shown mutations on branches, but I think you are just talking here about the hovering over the genome bar, right?

Here's an example I posted recently on the covid discussion slack: you can see how it is useful to see the mutations on the branches, but much less so on the genome bar. In ts draw_svg method, one of the main uses for displaying on the genome bar is purely to highlight which sites have multiple mutations (which is hard to tell from the branch labels)

image

hyanwong commented 1 week ago

Seems like it is getting to the point where you could merge the mutations branch into main?