reichlab / cladetime

Python interface for accessing Nextstrain SARS-CoV-2 sequence and clade data
https://cladetime.readthedocs.io
MIT License
0 stars 0 forks source link

Create a function to retrieve Nextclade ncov-ingest pipeline metadata at a point in time #12

Closed bsweger closed 1 month ago

bsweger commented 2 months ago

Background

Note requires completion of #20

On the date of model input data processing, we want to save some metadata about the current nextclade SARS-COV-2 dataset (i.e., the most current dataset listed when running nextclade dataset list --name sars-cov-2:

╭────────────────────────┬────────────────────────┬───────────────────────┬────────────────────────╮
│ name                   │ attributes             │ versions              │ capabilities           │
╞════════════════════════╪════════════════════════╪═══════════════════════╪════════════════════════╡
│ nextstrain/sars-cov-2/ │ "name"="SARS-CoV-2"    │ 2024-07-17--12-57-03Z │ clade (44)             │
│ wuhan-hu-1/orfs        │ "reference             │ 2024-07-03--08-29-55Z │ Nextclade_pango (3223) │
│ (shortcuts:            │ accession"="MN908947"  │ 2024-06-13--23-42-47Z │ clade_display (44)     │
│ "sars-cov-2", "nextstr │ "reference name"="Wuha │ 2024-04-25--01-03-07Z │ clade_nextstrain (44)  │
│ ain/sars-cov-2", "next │ n-Hu-1/2019"           │ 2024-04-15--15-08-22Z │ clade_who (13)         │
│ strain/sars-cov-2/wuha │                        │ 2024-02-16--04-00-32Z │ partiallyAliased       │
│ n-hu-1")               │                        │ 2024-01-16--20-31-02Z │ (3223)                 │
│                        │                        │                       │ qc.frameShifts         │
│                        │                        │                       │ qc.missingData         │
│                        │                        │                       │ qc.mixedSites          │
│                        │                        │                       │ qc.privateMutations    │
│                        │                        │                       │ qc.snpClusters         │
│                        │                        │                       │ qc.stopCodons          │
│                        │                        │                       │ mutLabels              │
╰────────────────────────┴────────────────────────┴───────────────────────┴────────────────────────╯

We'll need this information when it's time to score the models (~90 days after submission) to ensure that we use the correct reference tree for the clade assignments. This is also important information for reproducibility (e.g., the nextclade version used to generate the data).

Definition of done

Update the function created in #20 (which retrieves the latest version of nextclade_metadata.json

bsweger commented 2 months ago

nextclade_metadata.json is stored in a versioned S3 bucket. To see the dates and versionIds:

aws s3api list-object-versions --bucket nextstrain-data --prefix files/ncov/open/metadata_version.json --no-sign-request

To view a specific versionId of the file:

https://nextstrain-data.s3.amazonaws.com/files/ncov/open/metadata_version.json?versionId=[versionid]