tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Polytomy collapsing #3011

Closed hyanwong closed 1 week ago

hyanwong commented 1 week ago

Here's a quick demo of the visual scheme I have come up with for condensing trees with polytomies, so we show only the lineages relating to a set of tracked samples (tips in cyan). Such samples might represent (say) a geographical region, or a covid Pango lineage. Here's an example, followed by the suggested scheme:

Screenshot 2024-10-07 at 19 07 59

Condensed:

Screenshot 2024-10-07 at 23 29 23

Two things are going on here:

  1. nodes at the top of a clade of entirely untracked samples (here node 36) are collapsed with a. triangle showing the number of samples underneath as "+n" (as in #3010).
  2. where there are 2 or more lineages containing entirely untracked nodes that are part of a polytomy, that polytomy is collapsed into a dotted line (followed by "+n/m" where n is the number of samples and m is the number of additional branches in the polytomy)

Optionally (3rd plot), we can also collapse nodes that consist of entirely tracked samples (here node 39) into a triangle/trapezium:

Screenshot 2024-10-07 at 23 31 24

Does this look like a reasonable approach? I'm not sold on the "+n/m" notation but it was the most succinct/consistent that I could come up with.

hyanwong commented 1 week ago

Here's the viz run on a random covid pangolin lineage:

Screenshot 2024-10-08 at 00 16 00

Once we have defined a postorder_minlex_tracked_node_traversal, this is produced using e.g.

pango = "B.1.1.70"
tracked_nodes = ti.pango_lineage_samples[pango]
tree = ts.first(tracked_samples=tracked_nodes)
order = list(postorder_minlex_tracked_node_traversal(tree, collapse_tracked=False))
print(len(order), f"nodes in subtree. Nodes in magenta are {pango}")
tree.draw_svg(
    time_scale="rank",
    order=order,
    size=(1000, 800),
    node_labels={u: ts.node(u).metadata.get("Viridian_pangolin", "") for u in order if u not in tracked_nodes},
    mutation_labels={},
    symbol_size=4,
    summarise_untracked_polytomies=True,
    style=(
        "".join(f".n{u} > .sym {{fill: magenta}}" for u in tracked_nodes + [39]) +
        ".lab.summary {font-size: 9px}" + 
        ".polytomy {font-size: 10px}"
    ),
)
hyanwong commented 1 week ago

And here's a path to a pango lineage represented by a single sample:

Screenshot 2024-10-08 at 00 39 57
jeromekelleher commented 1 week ago

Looks great Yan!

hyanwong commented 1 week ago

Looks great Yan!

Great, thanks. I'll work it into a PR.

hyanwong commented 1 week ago

The main issue to which there is no easy solution is when we have a huge polytomy of (say) 1000 lineages, 999 of which are lineages containing entirely (or mostly) focal (tracked) samples, and one of which is not. We can't visually collapse parts of such a polytomy in an meaningful way: either we collapse the whole thing, or we have to show all the focal lineages, as we don't know how they relate to each other. For example, here's the top of the B.1.1.7 (alpha) lineage from a covid tree,

Screenshot 2024-10-08 at 15 51 27

I think this is an insoluble issue, so I'm happy to punt it down the line.