nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Improve serotype assignment in Dengue virus DENVx genotypes datasets #70

Open j23414 opened 4 weeks ago

j23414 commented 4 weeks ago

Context

Flagged by @rneher slack message, the Dengue virus DENVx genotypes dataset could be further improved in its clade assignments. For example for DENV1:

  1. DENV2 samples that align are correctly placed onto the outgroup node and marked as unassigned. (good!)
  2. However, DENV1 samples that don't belong to an annotated genotype are also marked as unassigned, which is arguably incorrect. (This could be improved!) An example shown below:

image

Description

These samples should be assigned to the DENV1 serotype without a specific genotype, rather than being marked as unassigned. To illustrate this group of samples visually, we aim to reduce the samples in the magenta region of the table:

Screenshot 2024-06-25 at 9 50 29 AM

Possible solution

To ensure accurate serotype assignment while allowing for true-negative genotype assignments. I'm currrently planning the following steps:

  1. In the dengue/all tree, identify the amino acid mutations from the dengue/all reconstructed root to the reconstructed root of each serotype.
  2. In each dengue/denv* tree, locate the amino acid mutations from the serotype reconstructed root to the outgroup dengue/all reconstructed root, and correct the coordinates accordingly.
  3. Add the corrected coordinates of the amino acid mutations to each of the clades_genotype_denv*.tsv files, using the serotype name (e.g., DENV1) as the identifier.

After implementing these changes:

Of course, open to other suggestions or guidance here.