nextstrain / auspice

Web app for visualizing pathogen evolution
https://docs.nextstrain.org/projects/auspice/
GNU Affero General Public License v3.0
292 stars 163 forks source link

Split and group subtrees by state. #1008

Open evogytis opened 4 years ago

evogytis commented 4 years ago

Context

Phylogenetic trees are inherently difficult to interpret. Whilst colouring the tree by inferred ancestral state (geographic location, host, resistance, etc) depicts the full history of state changes for every lineage across the tree it is nonetheless not easy to interpret, especially for non-experts. By showing migrations on a map the interpretation issue is ameliorated but only addresses the number of state changes and locations involved, leaving the user to interpret the fate of lineages in the new state from the tree by themselves.

By splitting and grouping lineages that have been introduced into a new location for as long as they stay in the same location one can get a sense of how many times lineages have been introduced and how well they did in the new location. Doing so gives a very visual interpretation of how suitable locations are for local transmission and intensity of migration over time. Locations that are poorly connected but suitable for local transmission would be indicated by few but large subtrees, locations that are well connected but unsuitable for local transmission as many small subtrees and so on.

Description

  1. Find all branches that have switched state of interest e.g. location of branch 2 is inferred to be in location B but parental branch 1 is inferred to be in location A.
  2. Initiate a subtree traversal starting from branch 2 with the condition that only branches inferred to also be in location B are to be traversed. After the traversal this leads to some nodes only having one child (because the other child migrated elsewhere) which might not be a problem for auspice. Sometimes both children of a node migrate elsewhere which can be problematic if branch positions are found by relying on tip order from a post-order traversal.
  3. Collect all conditionally traversed subtrees resulting from migrations to location B, display them as a group. Maybe give some indication near the root of the subtree where the introduction came from (like a tooltip or a coloured circle).
  4. Repeat for all other locations, distribute subtree groups along the y axis based on introduction location, so subtrees 1...n for introductions to location A at the top, subtrees 1...m for introductions to location B below those, etc.

Examples

Martha Nelson's work is the first example where I have found this type of visualisation: Swine flu


Here's a more extensive Ebola example though. This tree contains all of the information about the migration history of Ebola virus in West Africa during the epidemic in 2014-2016 but is not easy to interpret (even with a better aspect ratio).

However, breaking apart the tree using the method described above at the country level shows an easily interpretable picture: Ebola

Subtrees inferred to be in Guinea (green) are small but there's many of them, suggesting frequent introductions without much amplification of viral lineages via local transmission. Introductions originate (indicated by circle colour at the base of each subtree) from both Sierra Leone (blue) and Liberia (red). There are fewer subtrees in Sierra Leone (blue) but one of them is enormous, indicating that introductions were rarer (and primarily from Guinea) but that local transmission in Sierra Leone was rampant, leading to a local outbreak that is largely dominated by the one single lineage. Liberia (red) sits somewhere in the middle - its outbreak too is dominated by a single introduction but it is smaller and shorter with a few small introductions.

The splitting and grouping of subtrees visually explains the case data that we see perfectly - Guinea had a low level but protracted outbreak because Ebola virus was not able to spread as efficiently and instead cases were primarily generated via multiple introductions. Sierra Leone had a very intense peak with ~700 new cases reported per week, fuelled almost entirely by local transmission with Liberia being similar but slightly lower. I don't have a figure for number of cases but you can see everything together in this video.

Repeating the same process of splitting and grouping subtrees at a smaller level (administrative division level rather than country level) was even more illuminating though not looking quite as good. Even though we saw how the virus population was distinctly distributed between the three countries no such distinct clustering exists within each of the countries (with maybe the exception of Liberia). Importantly this answers a question many had at the time of the epidemic which is what role urban centres were playing in the region - were they persistently infected and seeding outbreaks elsewhere in the country or were the outbreaks short-lived everywhere. It was the latter.


Another example is MERS coronavirus. This is a viral infection that has been jumping from camels into humans since 2012 but the virus unlike SARS-CoV-2 doesn't transmit between humans well. Once can look at evolutionary history of MERS-CoV infections in humans (blue) and camels (orange, obviously) jointly: MERS full

or one can split and group lineages based on host change and see clearly that cross-species transmissions are only camel-to-human and very unsuccessful in humans:

MERS divided

Possible solution

One solution is to recompute y axis coordinates for every subtree such that they end up close to each other based on location and then hide their parental branches so there's no criss-crossing of vertical lines across subtrees.

Another has been discussed previously by Trevor and James before. Since auspice JSONs cannot contain multiple trees the most straightforward solution is to produce a JSON that contains all of the subtrees as descendants of the root node with the branch leading from the root to the base of the subtree invisible (I think this is possible). This might need to happen at augur level.

jameshadfield commented 4 years ago

Thanks Gytis ⭐️ An auspice implementation would be desirable as it would allow us to group subtrees interactively based on any non-continuous trait which is set on internal branches (i.e. augur traits has been run in the preparation of the dataset). Implementation wise, this seems simplest via (a) restricting it to rectangular layouts, (b) marking state-change branches as hidden, (c) writing a novel y-value setter for the tree (but of course the devil is in the details!). Importantly this wouldn't require modifying the underlying nodes structure in auspice.

frogsquire commented 4 years ago

If nobody else is working on this, and it's still a priority, I'd like to take a shot at the auspice implementation of this. I've been thinking about it for a few days and I think I understand what's required technically.

jameshadfield commented 4 years ago

@frogsquire That would be great! Might be worth posting periodic screenshots or questions here for @evogytis and I ensure it's along the lines we were envisaging. Thanks :)

frogsquire commented 4 years ago

@jameshadfield @evogytis I'd like to provide a quick update on this, in the hopes of making sure I'm on the right path, and resolving any issues early:

Over the last few days, I've been learning how we pass information from the UI through change() to redraw the tree, and how we calculate/store the y-values that I'll need to update in order to separate nodes by trait.

For the moment, I've implemented the UI as a checkbox - "split tree by colored-by trait". I felt this was a reasonable restriction since in the examples provided, color is largely what's used to differentiate the sections of the tree, and the color-by list already excludes traits we wouldn't want the user to do this by. (I haven't yet handled having this checkbox only appear/enable when in the rectangular layout.)

When the user checks this box, I trigger a redraw of the tree layout which will calculate a new y-value off of the precomputed y-value that is already present on the node. (The original y-value is itself computed when auspice is started up on page load, but since we can't know what trait the user will filter by then, or what will be visible on the graph, we can't compute the necessary offset then.) I'm using a temporary dictionary of unique values to store the offsets per-trait - this is the part I am still working on.

I'd be happy to hear feedback in terms of intended UI, general approach, and, if you find it worth the time to a look at the (admittedly not yet ready for use or testing) PR, the code.

jameshadfield commented 4 years ago

Thanks for the update! I've been looking at https://github.com/nextstrain/auspice/compare/master...frogsquire:1008-split-by-state (and checked it out locally to test). If you could submit a PR to auspice (it's no problem being in an unfinished state) that would make things a bit easier on our side.

For the moment, I've implemented the UI as a checkbox

This seems very sensible (In the future we can explore extend this to a trait other than the selected color-by).

I haven't yet handled having this checkbox only appear/enable when in the rectangular layout.

Not a problem. We'll also want to ensure that the color-by has values on internal nodes as this is necessary for this behavior.


The current "offsets" approach is producing some weird results, partly due to offsets only having a single key for each possible state, where in reality there can be many subtrees which start the same "key". For me, a better approach (at least one to consider) would be to completely recompute the node.yvalue across the tree on toggle (this should also allow a d3 transition to be run). (cc @evogytis for the best approach here) I'm imagining something along the lines of:

// pseudocode, based off the auspice function `setYValuesRecursively`
const subtreeQueue = [rootNode]
const setYValsForExplodedTree = (node, yCounter) => {
    if (node.children) {    
        for (let i=node.children.length-1; i>=0; i--) {
            if (getTraitValue(node.children[i]) !== currentTraitValue) {
                subtreeQueue.push(node.children[i]);
            } else {
                yCounter = setYValsForExplodedTree(node.children[i], yCounter);
            }
        }
    } else {
        node.n.yvalue = ++yCounter;
        node.yRange = [yCounter, yCounter];
        return yCounter;
    }
    /* if here, then all children have yvalues, but we dont. */
    node.n.yvalue = node.children.reduce((acc, d) => acc + d.n.yvalue, 0) / node.children.length;
    node.yRange = [node.n.children[0].yvalue, node.n.children[node.n.children.length - 1].yvalue];
    return yCounter;
}
let currentMaxY = 0;
while (subtreeStack.length) {
    const n = subtreeQueue.pop();
    currentMaxY = setYValsForExplodedTree = (n, currentMaxY);
}

This could then be triggered within the change() function using the logic you've implemented, i.e.

// pseudocode
if (this.params.splitTreeByTrait) {
   // call the y-value recomputation algorithm defined above 
} 

And then this should "Just Work" and be a lot faster (it probably won't just work!) and shouldn't require any modification to the actual d3 calls at all.

If this works, we can add optimisations such as caching the non-exploded yvalues to speed things up and play with d3 transitions as needed.

evogytis commented 4 years ago

Yes, node.yvalue should be recomputed because that will assign children under the same trait to the same y value as the parent (so it looks like a node with a single child), thus preserving y axis space and look much neater than continuously cascading branches with missing children.

The way I've done it before (cell 7 in https://github.com/evogytis/baltic/blob/master/docs/austechia.ipynb) is by computing subtree yvalues first (so they go from 0 to N leaves) and then iterating through subtrees sorted by landing state (and minimum xvalue in subtree) and then adding a cumulative y offset to every yvalue in the subtree. Substantially less elegant than your solution. With the pseudocode you posted the only missing ingredient I think is adding sorting of subtreeQueue by trait.

Something that should be accounted for in the code is what I call "hanging nodes" - they're subtrees without leaves in the same trait state. I ignore them because they are induced solely via the model. E.g. model says nodes can't go A>C so you end up going through an intermediate state B (A>B>C) which doesn't have leaves in state B (and so no yvalues). How you choose to handle these is subjective.

Also apologies if I've completely missed the mark, I'm very unfamiliar with auspice's internals.

frogsquire commented 4 years ago

For me, a better approach (at least one to consider) would be to completely recompute the node.yvalue across the tree on toggle (this should also allow a d3 transition to be run). (cc @evogytis for the best approach here) I'm imagining something along the lines of:

I did come across the original setYValuesRecursively function as I was looking into this. I'll try this approach. I see how this will produce a tree where the subtrees are separated, but it seems to me we will still need some kind of an offset between each subtree and the next, so that they look spaced appropriately... or will the d3 transition handle spacing of the nodes from a visual standpoint?

With the pseudocode you posted the only missing ingredient I think is adding sorting of subtreeQueue by trait.

How do we want to sort? Alphabetically? By order they're encountered in the tree? etc.

Something that should be accounted for in the code is what I call "hanging nodes" - they're subtrees without leaves in the same trait state. I ignore them because they are induced solely via the model. E.g. model says nodes can't go A>C so you end up going through an intermediate state B (A>B>C) which doesn't have leaves in state B (and so no yvalues). How you choose to handle these is subjective.

In that case, I think we could do one of a few things:

I suppose we could show B as a separate subtree, but that doesn't strike me as visually useful?

With the (pseudo)-code @jameshadfield and I have done so far, B (which would be a kind of "internal node" in auspice, I think?) will get shown as a separate subtree because the node will have an independent trait. Not sure how that'll look in the end.

jameshadfield commented 4 years ago

it seems to me we will still need some kind of an offset between each subtree and the next

Yes, I think so. This could be achieved within the while loop above via something like

while (subtreeQueue.length) {
    const n = subtreeQueue.pop();
    currentMaxY = setYValsForExplodedTree = (n, currentMaxY);
    currentMaxY += subtreeSeparationConstant;
}

How do we want to sort? Alphabetically?

cc @evogytis what do you recommend here?

hanging nodes

Without handling this case, I think this'll be a separate subtree. I think we follow Gytis here and don't show them, which would require marking the node as hidden in auspice. Can leave this until last.

evogytis commented 4 years ago

How do we want to sort? Alphabetically?

cc @evogytis what do you recommend here?

I think sorting by first introduction would be quite natural, though it's entirely arbitrary.

My €0.02 with offsets is that it's an easy thing to implement and it might not be necessary. Exploding the tree doesn't require extra y-axis space and the location colour of each subtree should convey boundaries between groups of subtrees. I'm not sure if changing the range of y-axis values will cause issues via altered aspect ratio in the tree panel but I'd say that if the current nextstrain tree panel aspect ratio is not the best for interpretation to begin with then this branch is not where it needs to be fixed.

frogsquire commented 4 years ago

I think sorting by first introduction would be quite natural, though it's entirely arbitrary.

By first introduction, you mean by whichever state has the node closest to the root of the tree? I believe that'd be off the inferred date trait.

My €0.02 with offsets is that it's an easy thing to implement and it might not be necessary

I'll see how it looks and whether we end up needing them.

I should have time to get some more done on this over the next few days.

evogytis commented 4 years ago

I think sorting by first introduction would be quite natural, though it's entirely arbitrary.

By first introduction, you mean by whichever state has the node closest to the root of the tree? I believe that'd be off the inferred date trait.

Yes, the trait here would be num_date. So the state of the the root is shown first, along with all of the subtrees that were reintroduced into the root's location from elsewhere, followed by the state that happened to be the location where the first export landed and subsequent introductions into it, etc.

frogsquire commented 4 years ago

@jameshadfield @evogytis Added a PR here: https://github.com/nextstrain/auspice/pull/1105

Not sure why it doesn't want to automatically link, but I can't seem to manually add it either - perhaps I don't have permissions?

Anyhow, an update on my progress and quite a few points in there. I'll continue working, so no pressure if you are busy, but would be happy to hear any feedback.