nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

Inherited clade definitions #823

Closed jameshadfield closed 2 years ago

jameshadfield commented 2 years ago

Currently clades are defined independently of one another in the provided TSV, but we often duplicate the mutations of a parent clade. For example, 21L is a descendant of 21M so we have the following:

21M (Omicron)   nuc     23525   T
21M (Omicron)   nuc     23599   G
21L (Omicron)   nuc     23525   T  ## mutation actually defines 21M
21L (Omicron)   nuc     23599   G  ## mutation actually defines 21M
21L (Omicron)   nuc     24424   T

We should allow clades to be inherited, e.g.:

21M (Omicron)   nuc     23525   T
21M (Omicron)   nuc     23599   G
21L (Omicron)   clade   21M (Omicron)
21L (Omicron)   nuc     24424   T

There are a few considerations here:

Related

Possible solution

There seem to be two implementations available:

  1. Expand the TSV upon parsing to replace the parent clade with the mutations of the parent clade
  2. Only consider the subtree defined by the parent clade and then find the clade defined by the extra mutations (e.g. 24424T in the example above).

I prefer solution 2, but I don't think the results will be different.

corneliusroemer commented 2 years ago

I will implement solution 1 since it's straightforward and simplifies the clade.tsv

We can always switch to solution 2 later.