Closed j23414 closed 1 month ago
Due to seeing similar issues with genotype-level datasets (e.g. denv4), I'm either going to either expand the scope of this PR to also fix DENV1-4 or split those out into separate PRs.
I'll start by adding outgroups/reconstructed ancestral sequences for each genotype-level dataset, perhaps as suggested by: https://github.com/nextstrain/nextclade_data/pull/203#issuecomment-2143765789
Incorporated edits and summarized said edits in https://github.com/nextstrain/nextclade_data/pull/203#issuecomment-2147990229 This is ready for review and dataset evaluation.
The genotype-level datasets require further improvement to meet the desired standards. However, the serotype-level dataset is functioning as expected.
To solidify the progress made with the serotype-level dataset, I move to merge the changes and shift the focus to enhancing the genotype-level datasets in a new PR.
This approach helps me avoid mixing completed tasks with those that still need refinement.
Description of proposed changes
When testing the
dengue/all
(serotype-level) dataset for accuracy, multiple people realized there was a trend of false-positive DENV4 classification. This PR mostly fixes that.The
dengue/all
dataset was improved by:all
tree.This fix was inspired by multiple sources of feedback, and the mpox codebase.
Related issue(s)
Checklist