Open bsweger opened 2 weeks ago
The original plan was to use the process described in #7 to get a SARS-CoV-2 reference tree at a specific point in time (@elray1 and I worked out that process in consultation with some nextclade folks)
However, those steps rely on the nextclade cli to download a dataset package and extract a reference tree.
We can make that work for pipeline-based processes (e.g., generating target data for the variant nowcast hub). However, now that we've also re-purposed cladetime as a way to support more interactive work cases, relying on the cli becomes an impediment to project setup.
As an alternative to using the nextclade cli, some sleuthing reveals the pattern for accessing nextclade datasets via the following URL pattern:
https://data.clades.nextstrain.org/v3/<dataset name>/<dataset version>/<dataset file>
In other words, to get the reference tree for the 2024-09-25--21-50-30Z
version of the nextstrain/sars-cov-2/wuhan-hu-1/orfs
dataset:
(we get the correct dataset version from the metadata for nextclade's ingest workflow, for example: https://nextstrain-data.s3.amazonaws.com/files/ncov/open/metadata_version.json?versionId=uYgLvuF9AKtkbyAUz95KmCTojm1QQ7BX)
@elray1 what do you think? Using direct https requests rather than going through the CLI presumably puts us at risk if nextclade makes breaking changes on their backend. But it would make the cladetime package much more usable.
The other, longer-term, option might be asking cladetime to make their CLI available via pypi?
To simplify the process of creating target data, add a method to
CladeTime
that gets a reference tree based on itstree_as_of
attribute.Definition of Done
Once complete, users should be able to do something like this: