From our work in Nanduri et al., we developed the pathogen-embed tools to project seasonal flu alignments into low-dimensional representations and identify clusters of genetically related sequences. We can use these tools to jointly embed alignments from multiple genes like HA and NA and identify putative reassortment events. The pathogen-embed package is now part of the Nextstrain Docker and Conda environments, so we can easily run these tools from our seasonal flu workflows.
Description
Add rules to the core seasonal flu workflow to annotate HA and NA trees with t-SNE embedding coordinates (tsne_x and tsne_y) using pathogen-distance and pathogen-embed and labels of clusters identified with pathogen-cluster (tsne_label). Calculate distances for each gene segment individually and produce a t-SNE embedding from all distances and alignments together using the optimal settings from Nanduri et al. Then, produce clusters using optimal settings for Nextstrain clades from the same work.
[ ] Calculate genetic distances per gene alignment with pathogen-distance
[ ] Generate t-SNE embedding with all gene alignments and distances with pathogen-embed
[ ] Generate clusters from t-SNE embedding with pathogen-cluster
[ ] Convert clusters and embedding TSV to node data JSON
[ ] Annotate all gene trees with clusters and embeddings
[ ] Update Auspice config JSONs to include colorings for the cluster label and embedding fields
Context
From our work in Nanduri et al., we developed the pathogen-embed tools to project seasonal flu alignments into low-dimensional representations and identify clusters of genetically related sequences. We can use these tools to jointly embed alignments from multiple genes like HA and NA and identify putative reassortment events. The pathogen-embed package is now part of the Nextstrain Docker and Conda environments, so we can easily run these tools from our seasonal flu workflows.
Description
Add rules to the core seasonal flu workflow to annotate HA and NA trees with t-SNE embedding coordinates (
tsne_x
andtsne_y
) usingpathogen-distance
andpathogen-embed
and labels of clusters identified withpathogen-cluster
(tsne_label
). Calculate distances for each gene segment individually and produce a t-SNE embedding from all distances and alignments together using the optimal settings from Nanduri et al. Then, produce clusters using optimal settings for Nextstrain clades from the same work.pathogen-distance
pathogen-embed
pathogen-cluster