nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Nextclade assignment #16

Closed j23414 closed 6 months ago

j23414 commented 9 months ago

Description of proposed changes

This PR is a continuation of breaking out the dengue ingest changes PR#6 into more focused and manageable pull requests. Following the merge of copy ingest, the focus of this PR is to split the dengue sequences by serotype (e.g. sequences_denv1.fasta to sequences_denv4.fasta).

Previously for dengue, we had been fetching each serotype separately in this code, leading to redundant fetching and processing of each sequence. Additionally, this method posed the risk of missing sequences in individual serotype builds if NCBI did not annotate the record with the lineage ID.

Therefore, as an alternative, this PR proposes to leverage Nextclade assignment to categorize the records into the major dengue serotypes, and subsequent subtypes.

The primary scope of this PR includes:

Related issue(s)

Checklist

git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout nextclade_assignment
nextstrain build ingest results/metadata_denv4.tsv
corneliusroemer commented 8 months ago

Looking at the PR, given that you already have working v2 datasets, I don't think it would be unreasonable to continue using nextclade2 for now (you can add the 2 to future-proof).

As the purpose of this PR is to solely split into serotypes, there's no need for v3 features. Transitioning from 2 -> 3 can happen once this is merged to reduce complexity.

j23414 commented 6 months ago

Rebased after switching to relying on NCBI virus-tax-id to split serotypes.

The subsequent nextclade subtype classification looks much cleaner.

Screenshot 2024-02-14 at 11 27 22 AM
j23414 commented 6 months ago

After chatting with people, I'm going to fixup the commits and merge since this branch fixes DENV2/AII (aka DENV2/IV) assignment: https://github.com/nextstrain/dengue/issues/28#issuecomment-1957257715