A package to perform data transformations on hubverse model-output files.
The package contains a ModelOutputHandler
class that reads, transforms, and writes a single Hubverse-compliant model-output file.
Currently, its primary purpose is for use as an AWS Lambda function that transforms model-output files uploaded to hub S3 bucket.
To install this package:
pip install git+https://github.com/hubverse-org/hubverse-transform.git
Sample usage:
from hubverse_transform.model_output import ModelOutputHandler
# to use with a local model-output file
mo = ModelOutputHandler(
'~/code/hubverse-cloud/model-output/UMass-flusion/2023-10-14-UMass-flusion.csv',
'/.'
)
# read the original model-output file into an Arrow table
original_file = mo.read_file()
# add new columns to the original model_output data
transformed_data = mo.add_columns(original_file)
# write transformed data to parquet
# TODO: fix this up for local filesystem (it's currently designed for S3 writes)
# mo.write(transformed_data)
Sample output of the original and transformed data:
In [31]: original_file.take([0,1])
Out[31]:
pyarrow.Table
reference_date: date32[day]
location: string
horizon: int64
target: string
target_end_date: date32[day]
output_type: string
output_type_id: double
value: double
----
reference_date: [[2023-10-14,2023-10-14]]
location: [["01","01"]]
horizon: [[0,0]]
target: [["wk inc flu hosp","wk inc flu hosp"]]
target_end_date: [[2023-10-14,2023-10-14]]
output_type: [["quantile","quantile"]]
output_type_id: [[0.01,0.025]]
value: [[0,1.5810684371620558]]
In [36]: transformed_data.take([0,1])
Out[36]:
pyarrow.Table
reference_date: date32[day]
location: string
horizon: int64
target: string
target_end_date: date32[day]
output_type: string
output_type_id: double
value: double
round_id: string
model_id: string
----
reference_date: [[2023-10-14,2023-10-14]]
location: [["01","01"]]
horizon: [[0,0]]
target: [["wk inc flu hosp","wk inc flu hosp"]]
target_end_date: [[2023-10-14,2023-10-14]]
output_type: [["quantile","quantile"]]
output_type_id: [[0.01,0.025]]
value: [[0,1.5810684371620558]]
round_id: [["2023-10-14","2023-10-14"]]
model_id: [["UMass-flusion","UMass-flusion"]]
...
If you'd like to contribute, this section has the setup instructions.
Prerequisites
Python 3.12
Note: There are several options for installing Python on your machine:
A way to manage Python virtual environments
There are many tools for managing Python virtual environments. The setup instructions below use venv
which comes with Python, but if you prefer another virtual environment management tool, feel free to use it.
Setup
Follow the directions below to set this project up on your local machine.
requires-python
constraint in pyproject.toml.hubverse-transform
):python -m venv .venv
Activate the virtual environment:
# MacOs/Linux
source .venv/bin/activate
# Windows
.venv\Scripts\activate
pip install -e . && pip install -r requirements/requirements-dev.txt
Verify that everything is working by running the test suite:
pytest
Because we want a robust lockfile to use for reproducible builds, adding dependencies to the project is a multi-step process. Here we use uv
to resolve and install the project's dependencies. However, pip-tools
will also work (uv
is a drop-in replacement for pip-tools
and is much faster).
Prerequisites:
Add the new dependency to pyproject.toml
(don't be too prescriptive about versions):
hubverse_transform
to run should be added to the dependencies
section.dev
section of project.optional-dependencies
.Generate updated requirements files:
uv pip compile pyproject.toml -o requirements/requirements.txt && uv pip compile pyproject.toml --extra dev -o requirements/requirements-dev.txt
Update project dependencies:
Note: This package was originally developed on MacOS. If you have trouble installing the dependencies. uv pip sync
has a --python-platform
flag that can be used to specify the platform.
# note: requirements-dev.txt contains the base requirements AND the dev requirements
#
# using pip
pip install -r requirements/requirements-dev.txt
#
# alternately, you can use uv to install the dependencies: it is faster and has a
# a handy sync option that will cleanup unused dependencieså
uv pip sync requirements/requirements-dev.txt
Temporary: next step is to deploy updates to the lambda package via GitHub Actions
To package the hubverse_transform code for deployment to the hubverse-transform-model-output
AWS Lambda function:
hubverse-assets
S3 bucketsource deploy_lambda.sh
If you need to re-run the hubverse-transform function on model-output files that have already been uploaded to S3,
you can use the lambda_retrigger_model_output_add.py
script in this repo's faas/
folder.
This manual action should be done with care but can be handy if data needs to be re-processed (in the event of a
hubverse-transform bug fix, for example). The script works by updating the S3 metadata for every file in the
raw/model-output
file of the hub's S3 bucket. The metadata update then triggers the lambda function that runs
when new incoming model-output files are detected.
Note: You will need write access to the hub's S3 bucket to use this script.