nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

ENH: Augur clade "reconstruction from metadata" mode #1149

Open corneliusroemer opened 1 year ago

corneliusroemer commented 1 year ago

Context

augur clades currently places clades by finding the biggest branches on a tree that satisfy each clades defining mutations as provided by the user.

This method has a few downsides: to add a new clade, one needs to manually identify a stable set of mutations. Sometimes, due to artefacts or tree building randomness, this causes clades to not appear where we would want them.

Description

There's an alternative method we could use to annotate clades: reconstructing from the tips inwards, using clade annotations one some or all of the tips in the form of a metadata.tsv

This is essentially how Pango lineages are annotated on the SARS-CoV-2 Nextclade reference tree. All it takes is for the tips to be annotated with the clades one would want them to have.

This is easily implemented using treetime's discrete ancestral state reconstruction, already used in augur ancestral and augur traits.

Use cases include:

It would be possible to pass a clade hierarchy to ensure clades are less often reversed. Rather than reconstructing one discrete state with as many values as there are clades, we could reconstruct as many binary states as there are clades. For example, a tip that has XBB.1.5 (23A) set to true would also be true for XBB (22F) but not vice versa. The most specific clade that is true is chosen as annotation. Conflicts need to be resolved with a heuristic, but are rare. This is the way the Nextclade Pango annotation works.

emmahodcroft commented 1 year ago

I can see the use-case for this in some things, but for some I'm a little hesitant, as well. For example, I can totally see how we don't want to specify mutations for all Pango lineages, so just reconstructing this is probably fine, and easy. However, for things like Nextstrain clades, I think we probably do prefer when these are specified discretely with mutations so that we know exactly where they should fall. They're not fool-proof, we know that, especially if you pick a bad/weird mut, or have a super-weird seq. However, inferring backwards seems a little more easily 'haywire' going to me.

That doesn't mean this still wouldn't be a useful feature, just that I think I'd exercise caution in when to implement it - and to have some recommendations perhaps on when others should use it.

emmahodcroft commented 1 year ago

When we have cases where we've used Nextclade to assign tips, we can expect this method to behave pretty well (because the 'tree' knowledge gets baked in earlier, in a way). However, you can imagine people just randomly taking labels from old studies/work, reconstructing, and basically ending up with a mess 😆 However, I recognise we can't protect every user from themselves!