reichlab / virus-clade-utils

MIT License
0 stars 0 forks source link

Create a function to retrieve nextclade dataset metadata for a specific date #12

Open bsweger opened 2 weeks ago

bsweger commented 2 weeks ago

Background

On the date of model input data processing, we want to save some metadata about the current nextclade SARS-COV-2 dataset (i.e., the most current dataset listed when running nextclade dataset list --name sars-cov-2:

╭────────────────────────┬────────────────────────┬───────────────────────┬────────────────────────╮
│ name                   │ attributes             │ versions              │ capabilities           │
╞════════════════════════╪════════════════════════╪═══════════════════════╪════════════════════════╡
│ nextstrain/sars-cov-2/ │ "name"="SARS-CoV-2"    │ 2024-07-17--12-57-03Z │ clade (44)             │
│ wuhan-hu-1/orfs        │ "reference             │ 2024-07-03--08-29-55Z │ Nextclade_pango (3223) │
│ (shortcuts:            │ accession"="MN908947"  │ 2024-06-13--23-42-47Z │ clade_display (44)     │
│ "sars-cov-2", "nextstr │ "reference name"="Wuha │ 2024-04-25--01-03-07Z │ clade_nextstrain (44)  │
│ ain/sars-cov-2", "next │ n-Hu-1/2019"           │ 2024-04-15--15-08-22Z │ clade_who (13)         │
│ strain/sars-cov-2/wuha │                        │ 2024-02-16--04-00-32Z │ partiallyAliased       │
│ n-hu-1")               │                        │ 2024-01-16--20-31-02Z │ (3223)                 │
│                        │                        │                       │ qc.frameShifts         │
│                        │                        │                       │ qc.missingData         │
│                        │                        │                       │ qc.mixedSites          │
│                        │                        │                       │ qc.privateMutations    │
│                        │                        │                       │ qc.snpClusters         │
│                        │                        │                       │ qc.stopCodons          │
│                        │                        │                       │ mutLabels              │
╰────────────────────────┴────────────────────────┴───────────────────────┴────────────────────────╯

We'll need this information when it's time to score the models (~90 days after submission) to ensure that we use the correct reference tree for the clade assignments. This is also important information for reproducibility (e.g., the nextclade version used to generate the data).

Definition of done

Add a function to virus-clade-utils:

bsweger commented 2 weeks ago

nextclade_metadata.json is stored in a versioned S3 bucket. To see the dates and versionIds:

aws s3api list-object-versions --bucket nextstrain-data --prefix files/ncov/open/metadata_version.json --no-sign-request

To view a specific versionId of the file:

https://nextstrain-data.s3.amazonaws.com/files/ncov/open/metadata_version.json?versionId=[versionid]