nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Split by serotype using NCBI virus_tax_id #20

Closed j23414 closed 6 months ago

j23414 commented 7 months ago

Description of proposed changes

Split records by serotype by using the virus-tax-id field in NCBI datasets. Historically, we obtained each serotype individually in this code, leading to redundant fetching and processing of each sequence. Now that we're using ncbi datasets, these numeric IDs are recorded in a virus-tax-id field, that we can use to separate the serotypes.

Screenshot 2024-02-05 at 4 33 31 PM

Changes in this PR include:

  1. Pull virus-tax-id and virus-name from the dengue NCBI dataset
  2. Use the virus-tax-id to infer the ncbi_serotype field of each record
  3. Use the ncbi_serotype field to split sequences and metadata files into serotypes
  4. Update target files in the Snakefile to reflect the new serotype split

Related issue(s)

Checklist