nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Fine tuning the Nextclade all dataset #58

Closed j23414 closed 1 month ago

j23414 commented 1 month ago

Description of proposed changes

When testing the dengue/all (serotype-level) dataset for accuracy, multiple people realized there was a trend of false-positive DENV4 classification. This PR mostly fixes that.

Screenshot 2024-05-31 at 10 36 55 AM

The dengue/all dataset was improved by:

  1. Adding some "--penalty-gap-*" attributes to promote contiguous alignment (instead of gappy)
  2. Adding a reconstructed root into the all tree.
Screenshot 2024-05-31 at 10 42 01 AM

This fix was inspired by multiple sources of feedback, and the mpox codebase.

Related issue(s)

Checklist

j23414 commented 1 month ago

Due to seeing similar issues with genotype-level datasets (e.g. denv4), I'm either going to either expand the scope of this PR to also fix DENV1-4 or split those out into separate PRs.

Screenshot 2024-06-03 at 1 33 41 PM

I'll start by adding outgroups/reconstructed ancestral sequences for each genotype-level dataset, perhaps as suggested by: https://github.com/nextstrain/nextclade_data/pull/203#issuecomment-2143765789

j23414 commented 1 month ago

Incorporated edits and summarized said edits in https://github.com/nextstrain/nextclade_data/pull/203#issuecomment-2147990229 This is ready for review and dataset evaluation.

j23414 commented 1 month ago

The genotype-level datasets require further improvement to meet the desired standards. However, the serotype-level dataset is functioning as expected.

To solidify the progress made with the serotype-level dataset, I move to merge the changes and shift the focus to enhancing the genotype-level datasets in a new PR.

This approach helps me avoid mixing completed tasks with those that still need refinement.