reichlab / cladetime

Documentation
https://cladetime.readthedocs.io
MIT License
0 stars 0 forks source link

Add a CladeTime method to return an "as_of" reference tree #28

Open bsweger opened 2 weeks ago

bsweger commented 2 weeks ago

To simplify the process of creating target data, add a method to CladeTime that gets a reference tree based on its tree_as_of attribute.

Definition of Done

Once complete, users should be able to do something like this:

from virus_clade_utils.cladetime import CladeTime

ct = CladeTime(sequence_as_of="2024-09-01", tree_as_of="2024-07-31")

# get reference tree that was in use as of 2024-07-31:
ct.reference_tree()
bsweger commented 1 day ago

Getting the correct reference tree

The original plan was to use the process described in #7 to get a SARS-CoV-2 reference tree at a specific point in time (@elray1 and I worked out that process in consultation with some nextclade folks)

However, those steps rely on the nextclade cli to download a dataset package and extract a reference tree.

We can make that work for pipeline-based processes (e.g., generating target data for the variant nowcast hub). However, now that we've also re-purposed cladetime as a way to support more interactive work cases, relying on the cli becomes an impediment to project setup.

New proposal

As an alternative to using the nextclade cli, some sleuthing reveals the pattern for accessing nextclade datasets via the following URL pattern:

https://data.clades.nextstrain.org/v3/<dataset name>/<dataset version>/<dataset file>

In other words, to get the reference tree for the 2024-09-25--21-50-30Z version of the nextstrain/sars-cov-2/wuhan-hu-1/orfs dataset:

https://data.clades.nextstrain.org/v3/nextstrain/sars-cov-2/wuhan-hu-1/orfs/2024-09-25--21-50-30Z/tree.json

(we get the correct dataset version from the metadata for nextclade's ingest workflow, for example: https://nextstrain-data.s3.amazonaws.com/files/ncov/open/metadata_version.json?versionId=uYgLvuF9AKtkbyAUz95KmCTojm1QQ7BX)

@elray1 what do you think? Using direct https requests rather than going through the CLI presumably puts us at risk if nextclade makes breaking changes on their backend. But it would make the cladetime package much more usable.

The other, longer-term, option might be asking cladetime to make their CLI available via pypi?