nextstrain / seasonal-flu

Scripts. config, and snakefiles for seasonal-flu nextstrain builds
44 stars 26 forks source link

Nextclade dataset for "broad" H1N1pdm HA misassigns early clade labels #171

Closed huddlej closed 4 months ago

huddlej commented 4 months ago

Current Behavior

In a recent project, we have been assigning clade labels to early H1N1pdm HA sequences (circa 2009-2014) with Nextclade using the "broad" H1N1pdm HA dataset (flu_h1n1pdm_ha_broad). In this process, we discovered that sequences from early clades like 6C, 7, or 8 were getting assigned incorrect labels. Specifically, 6C sequences were getting assigned as clade 6.

Looking at the tree for this broad dataset in Nextclade, it is clear that early clades are not well-represented. For example clades 6C and 7 only have a single sequence each in the current dataset tree. Clade 8 does not appear in the tree at all.

image

To test the clade misassignment, I downloaded all 6C sequences represented in the current H1N1pdm HA 12y tree, and ran Nextclade with the broad dataset on these sequences. Of the 72 total sequences, 69 (96%) got assigned to clade 6 and only 3 were assigned properly to 6C.

Expected behavior

The "broad" datasets for H1N1pdm HA and NA should have representative samples from the earliest clades starting in 2009. The following earliest historical Nextstrain tree in S3 shows that we should have at least clades 1, 2, 3, 4, and 8 in the broad tree:

image 2

Possible solution

It turns out that the "broad" dataset's subsampling scheme has a min date of 2014 which explains why the early clades are not well-represented. A simple fix would be to change that min date to 2009.

The current long clade definitions for H1N1pdm HA include clades 2, 3, 4, 6, 7, and 8, so we don't have a way to assign clade 1 in the current definitions. The proposed change above would at least improve early clade representation. We could always update the clade definitions or the Nextclade workflow in this repo to assign clade 1 as the clade for the MRCA.