reichlab / cladetime

Documentation
https://cladetime.readthedocs.io
MIT License
0 stars 0 forks source link

Add a sequence_metadata attribute to CladeTime #27

Closed bsweger closed 2 weeks ago

bsweger commented 2 weeks ago

Background

This is the first step towards saving daily sequence counts by location: https://github.com/reichlab/variant-nowcast-hub/issues/50

We don't have to download sequence metadata files from S3 before working with them, so this PR adds an attribute to the CladeTime class that exposes a Polars LazyFrame pointing to a Nextstrain sequence metdata file.

Next Step

Once this new feature is merged, we can add code to variant-nowcast-hub to instantiate a CladeTime object and use the LazyFrame reference to create the location/data information outlined in the above issue.

Testing

To test this new feature as a code reviewer, you'll need to install virus_clade_utils from this feature branch:

pip install "git+https://github.com/reichlab/virus-clade-utils.git@bsweger/sequence-by-state-date/50"

Then from a Python session:

import polars as pl
from virus_clade_utils.cladetime import CladeTime

# Get a CladeTime object for the most recent Nextstrain sequence metadata
ct = CladeTime()

# ct.sequence_metadata is the new attribute (a LazyFrame)
filtered_metadata = (
    ct.sequence_metadata
    .select(["country", "division", "host", "date", "clade_nextstrain"])
    .filter(
        pl.col("country") == "USA"
     )
).collect()