nextstrain / seasonal-flu

Scripts. config, and snakefiles for seasonal-flu nextstrain builds
44 stars 26 forks source link

Add representative samples from early clades to "broad" H1N1pdm HA Nextclade dataset #172

Closed huddlej closed 2 months ago

huddlej commented 2 months ago

Description of proposed changes

Extends the min date for the "broad" H1N1pdm HA dataset from 2014 to 2009 and adds manually curated representative strain names for early clades 2, 3, 4, 7, and 8 to the "force-include" list for the Nextclade workflow. These changes allow the broad Nextclade dataset to represent most early clades (except clade 1) such that early sequences can be properly assigned to those clades.

This approach of forcing inclusion of representative strains works around the workflow's filter of QC=bad sequences where the QC is based on the more recent Nextclade dataset. Since that dataset lacks early clades, early sequences from those clades map to the newer tree with too many private mutations and get flagged with bad QC. A better approach could be to run Nextclade with the "broad" dataset for each lineage, to minimize the number of false positive bad QC labels, but that is for a future discussion/PR.

The following image shows the updated tree with clades 2, 3, 4, 6C, 7, and 8 represented by multiple sequences:

image

After adding clade 1 to the H1 HA definitions, I added representative clade 1 samples to be force-included in the broad H1 HA Nextclade dataset and rebuilt the tree. The updated tree looks like this with clade 1 as the MRCA instead of clade 2:

image

Related issue(s)

Closes #171

Checklist