nextstrain / forecasts-ncov

SARS-CoV-2 variant growth rates and frequency forecasts
https://nextstrain.org/sars-cov-2/forecasts/
7 stars 2 forks source link

Reduce disk write of 20+GB metadata file by filtering on the fly #80

Closed corneliusroemer closed 8 months ago

corneliusroemer commented 10 months ago

Context

Ingest runs quite slowly partially because it involves writing around 20GB to disk rather than streaming directly into tsv-filter here: https://github.com/nextstrain/forecasts-ncov/blob/d051c57cdea7b174e6fcff9e890283a431a67879/ingest/rules/sequence_counts.smk#L6-L33

The two rules could be turned into one and do the filtering on the fly.

joverlee521 commented 10 months ago

This will require updates to the vendored download-from-s3 script to support streaming to stdout that can then be piped to tsv-filter.

corneliusroemer commented 10 months ago

Quick look suggests it might work with /dev/stdout?