nextstrain / forecasts-ncov

SARS-CoV-2 variant growth rates and frequency forecasts
https://nextstrain.org/sars-cov-2/forecasts/
7 stars 2 forks source link

Stream metadata from S3, filter, and compress #92

Closed huddlej closed 5 months ago

huddlej commented 5 months ago

Description of proposed changes

Instead of downloading the complete metadata files from S3, uncompressing them to disk, filtering the uncompressed file to the subset of columns we want into an uncompressed file, and deleting the full metadata, this commit proposes streaming the original metadata from S3 through the filter step and writing the subset of metadata to a compressed file.

This change replaces the vendored download-from-s3 script with aws s3 cp since the latter can stream to stdout while the former cannot because it also writes log messages to stdout.

Testing with the "open" ncov dataset, this change avoids storing an 11GB full metadata file locally and produces a 71MB compressed subset metadata file in roughly the same amount of processing time.

Related issue

Closes #80

Checklist

huddlej commented 5 months ago

@joverlee521 This is just a suggestion PR after I tried and failed to run the ingest workflow on my work laptop because I didn't have enough disk space to even process the open data. If this seems like a terrible idea for other reasons I'm naive about, don't hesitate to close this.

huddlej commented 5 months ago

Thanks, @joverlee521! The trial run looks like it ran as expected, so I will merge this. I don't see a notification in the #nextstrain-counts-updates Slack channel, but maybe that's ok?

joverlee521 commented 5 months ago

I don't see a notification in the #nextstrain-counts-updates Slack channel, but maybe that's ok?

Ah yeah, I used the #scratch channel when I triggered the trial run so they wouldn't mix in with the "real" notifications. I see they came through just fine!