nextstrain / seasonal-flu

Scripts. config, and snakefiles for seasonal-flu nextstrain builds
44 stars 26 forks source link

Explore use of IQ-TREE's constraint tree option #79

Open joverlee521 opened 2 years ago

joverlee521 commented 2 years ago

@corneliusroemer experimented with IQ-TREE's constraint tree option to prevent IQ-TREE from putting clades in the wrong place for NextClade reference trees. This seems like a good way to ensure correct trees especially in 6m builds that may be lacking context sequences.

One issue brought up in initial Slack thread:

One issue to sort out would be delimiter in sequence names, right now IQtree renames all / as some weird string

corneliusroemer commented 2 years ago

This is how it's used right now in the SC2 reference tree workflow:

Simply add constraint tree file path after -g to tree builder args: https://github.com/neherlab/nextclade_data_workflows/blob/09be86c1718ffab2deed7060c3f7a70c135c530d/sars-cov-2/defaults/parameters.yaml#L22

And that's the hand coded tree: https://github.com/neherlab/nextclade_data_workflows/blob/feat/gisaid-v2/sars-cov-2/defaults/constraint.nwk

In the flu case, there are two options:

  1. Either you get (synthetic) prototypical sequences for each clade with constant names, like 2A, 2A.1 etc. (similar to the SC2 workflow) and hand code a short Newick tree with the right topology
  2. Or you generate a constraint tree using actual sequence names based on the topology as revealed through clade-hierarchies or handcoded in a newick tree that's read in by Biopython.Phylo

Both approaches should work.