We don't have to download sequence metadata files from S3 before working with them, so this PR adds an attribute to the CladeTime class that exposes a Polars LazyFrame pointing to a Nextstrain sequence metdata file.
Next Step
Once this new feature is merged, we can add code to variant-nowcast-hub to instantiate a CladeTime object and use the LazyFrame reference to create the location/data information outlined in the above issue.
Testing
To test this new feature as a code reviewer, you'll need to install virus_clade_utils from this feature branch:
import polars as pl
from virus_clade_utils.cladetime import CladeTime
# Get a CladeTime object for the most recent Nextstrain sequence metadata
ct = CladeTime()
# ct.sequence_metadata is the new attribute (a LazyFrame)
filtered_metadata = (
ct.sequence_metadata
.select(["country", "division", "host", "date", "clade_nextstrain"])
.filter(
pl.col("country") == "USA"
)
).collect()
Background
This is the first step towards saving daily sequence counts by location: https://github.com/reichlab/variant-nowcast-hub/issues/50
We don't have to download sequence metadata files from S3 before working with them, so this PR adds an attribute to the
CladeTime
class that exposes a Polars LazyFrame pointing to a Nextstrain sequence metdata file.Next Step
Once this new feature is merged, we can add code to
variant-nowcast-hub
to instantiate aCladeTime
object and use the LazyFrame reference to create the location/data information outlined in the above issue.Testing
To test this new feature as a code reviewer, you'll need to install
virus_clade_utils
from this feature branch:Then from a Python session: