Extends the min date for the "broad" H1N1pdm HA dataset from 2014 to 2009 and adds manually curated representative strain names for early clades 2, 3, 4, 7, and 8 to the "force-include" list for the Nextclade workflow. These changes allow the broad Nextclade dataset to represent most early clades (except clade 1) such that early sequences can be properly assigned to those clades.
This approach of forcing inclusion of representative strains works around the workflow's filter of QC=bad sequences where the QC is based on the more recent Nextclade dataset. Since that dataset lacks early clades, early sequences from those clades map to the newer tree with too many private mutations and get flagged with bad QC. A better approach could be to run Nextclade with the "broad" dataset for each lineage, to minimize the number of false positive bad QC labels, but that is for a future discussion/PR.
The following image shows the updated tree with clades 2, 3, 4, 6C, 7, and 8 represented by multiple sequences:
After adding clade 1 to the H1 HA definitions, I added representative clade 1 samples to be force-included in the broad H1 HA Nextclade dataset and rebuilt the tree. The updated tree looks like this with clade 1 as the MRCA instead of clade 2:
Description of proposed changes
Extends the min date for the "broad" H1N1pdm HA dataset from 2014 to 2009 and adds manually curated representative strain names for early clades 2, 3, 4, 7, and 8 to the "force-include" list for the Nextclade workflow. These changes allow the broad Nextclade dataset to represent most early clades (except clade 1) such that early sequences can be properly assigned to those clades.
This approach of forcing inclusion of representative strains works around the workflow's filter of QC=bad sequences where the QC is based on the more recent Nextclade dataset. Since that dataset lacks early clades, early sequences from those clades map to the newer tree with too many private mutations and get flagged with bad QC. A better approach could be to run Nextclade with the "broad" dataset for each lineage, to minimize the number of false positive bad QC labels, but that is for a future discussion/PR.
The following image shows the updated tree with clades 2, 3, 4, 6C, 7, and 8 represented by multiple sequences:
After adding clade 1 to the H1 HA definitions, I added representative clade 1 samples to be force-included in the broad H1 HA Nextclade dataset and rebuilt the tree. The updated tree looks like this with clade 1 as the MRCA instead of clade 2:
Related issue(s)
Closes #171
Checklist