command proposal to stitch together multiple trees

jameshadfield commented 5 years ago

This is a proposal for a new augur command, augur stitch to take multiple trees and combine them. This is -- usually but not exclusively -- desired for when you want to visualise exploded trees.

How are exploded trees defined in the JSON (for auspice v2)

If a hidden property is defined on a node in the tree, which can take values "always", "timetree" or "num_date" then that node won't be rendered by auspice in the relevant views. You can see a proof-of-principle of this at [1] -- toggle between TIME & DIVERGENCE to see.

If the hidden property is present in a node-data JSON then augur export will assign this accordingly. (It may not yet, but will before the v6 release.) This allows custom node attributes / metadata to define which nodes are hidden from a single tree.

Why do we need a new command?

If one has multiple trees then they must be combined at a step prior to augur export. There are multiple scenarios where you'd want to do this:

you have divergent sets of sequences and you want to align each set independently
you want to use different clock rates (via different refine runs)
if you are concatenating non-reassorting segmented genomes (what we're doing in the Seattle flu study right now).

We could combine this into augur export but I fear that'd over-complicate it's usage. I propose augur stitch which behaves something like (improvements most welcome!): EDIT: see comment below for, I think, a better syntax

augur stitch
    --input <newick> [<json> <json> <json> ...]
    --input <newick> [<json> <json> <json> ...]
    [--ancestral-relationship <newick> [<json>]]
    --output-tree <newick> --output-node-data <json>

Where the --ancestral-relationship defines the relationship between the provided --input trees. If this is missing then there's a polytomy at the root which is hidden for auspice (i.e. exploded tree). Note that you can hide this polytomy for the temporal view only (or vice versa) -- see [1].

Salient points:

The command must know which node-data JSONs belong to which tree, as all trees will (probably) have internal node names NODE0001 etcetera.
A use case not addressed here, but which should be, is if you want one node in to be replaced with a different tree -- you could imagine a ZEBOV tree like [1] but where the west african node is actually a link to the entire nextstrain.org/ebola tree, and the DRC18/19 node is replaced with the data from https://nextstrain.org/community/inrb-drc/ebola-nord-kivu. This would involve some cool improvements to auspice one day 😃

[1] https://blab.github.io/ZEBOV/

trvrb commented 5 years ago

This is fantastic James. Thanks for starting discussion. I definitely agree to separate this out from augur export. This is an advanced function and would like augur export help not to get anymore cluttered.

I like the name augur stitch.

There might be an issue with node data output in the above command line signature. Unless the idea is that traits.json, nt_muts.json, etc... all get combined into a single node data JSON which would make sense.

jameshadfield commented 5 years ago

Unless the idea is that traits.json, nt_muts.json, etc... all get combined into a single node data JSON which would make sense.

That was my idea, but it needs some thinking. Not sure what to do with (e.g.) properties alignment clock... I don't think they're actually used by augur export but still.

jameshadfield commented 5 years ago

Thinking of ways of allowing trees to replace nodes (which is actually what the --ancestral-relationship is doing now that I think about it) how about

    --tree-root <newick> [<json> <json> <json> ...]

for trees to be inserted at the root (with a polytomy if there are multiple of these) and

    --tree-node <node_name> <newick> [<json> <json> <json> ...]

Where <node_name> defines a terminal node which has to have appeared in an earlier tree, and will be replaced with this tree. This would allow something like

augur stitch
    --tree-root ebov_species_tree.new [<json> <json> <json> ...]
    --tree-node "West_Africa/2013" west_africa_ebola.new [<json> <json> <json> ...]
    --tree-node "Kivu/2018" drc_ebola.new [<json> <json> <json> ...]

(how auspice displays this is another matter ;) )

emmahodcroft commented 5 years ago

So partially related, but, I see the 'hidden' attribute as being pretty useful for bacterial stuff where there's recombination (so, most of it). We've run a few datasets like this, and the data at the tips is probably fairly reliable (and usually of most interest - is there an outbreak? where? when?), but deeper in the tree it's probably nonsense.

However, you can tell people this 100 times and they still tend to start commenting on it... It would be neat to be able to hide these deeper connections to really prevent this.

This is a bit different as it would be one tree that you then explode rather than stitching together multiple trees. I don't know if it would make sense to include something that allows this in augur stitch (which doesn't go with the name, really), have this somewhere else, or for the time being leave this as a trick that a user would have to code themselves.

I think possibly this is a bit aside from the main point here and can be left, but wanted to include how I see that I'd use it for future thinking :)

trvrb commented 5 years ago

@jameshadfield: I like --tree-root <newick> [<json> <json> <json> ...] vs --tree-node. Very clever. However, the signature

--tree-node <node_name> <newick> [<json> <json> <json> ...]

where defines a terminal node which has to have appeared in an earlier tree, and will be replaced with this tree.

seems to have an issue. If multiple refined Newicks make it to this point, specifying say NODE_00004 is likely not to be unique across the set of Newicks. I think you'd need to specify both the node name and the origin Newick.

@emmahodcroft: I think this is a great use case. This is basically the same thing I'm trying to accomplish with the stitched together flu trees. An alternative approach would have been to make concatenated FASTAs across all 8 flu segments and make a tree as normal, but then have some function to identify ancestral branches with too many homoplasies (or a distribution of homoplasies consistent with reassortment) and hide them.

Generally, we may want to think how an augur explode function may also work, where it takes in a single tree and hides nodes. This is how you'd accomplish Figure 2 from https://bedford.io/papers/dudas-ebola-epidemic-spread/, where augur explode would look at the country trait and use transitions across this trait to hide nodes. Or maybe augur explode creates multiple downstream Newicks that can get stitched back together (or not and analyzed separately)?

jameshadfield commented 5 years ago

but then have some function to identify ancestral branches with too many homoplasies (or a distribution of homoplasies consistent with reassortment) and hide them.

Yeah, I think an augur explode is a good idea here. While one can simply define a hidden trait on nodes (e.g. a script to mark the country transitions as hidden) and then augur export will do the rest (well, not currently). But this'll look like the same tree only with a few branches missing from within the tree... what we actually want is to break all of those subtrees out and then stitch them back at a polytomy.

jameshadfield commented 5 years ago

If multiple refined Newicks make it to this point, specifying say NODE_00004 is likely not to be unique across the set of Newicks. I think you'd need to specify both the node name and the origin Newick.

Yes, that's not good. Do we ever want to allow this? I was thinking you could only stitch trees into terminal nodes. Maybe there's cool use cases i'm not thinking of.

jameshadfield commented 5 years ago

We've run a few datasets like this, and the data at the tips is probably fairly reliable (and usually of most interest - is there an outbreak? where? when?), but deeper in the tree it's probably nonsense.

This ☝️ -- there's a lot of trees where the temporal information is rubbish further back in the tree and it'd be great to (a) find where this line is and (b) hide nodes older than it in temporal view. This wouldn't require augur explode as you want the same tree structure just with hidden nodes.

nextstrain / augur