Do we need `uncompress_cog_uk_metadata` and `compress_cog_uk_metadata`?

nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.

MIT License

35 stars 20 forks source link

Context

A question that came up as I was working on #240: Do we need to uncompress/compress the COG UK metadata during the workflow?

The transform_genbank_metadata rule uses the gzipped COGUK metadata file directly. I do not see any other rule consuming the uncompressed COG UK metadata as input, so it seems like we are uncompressing/compressing for the sake of being able to have a copy on AWS S3 that is zstd compressed.

It's not clear how much resources these jobs actually take up since we don't have benchmark files (yet!). I'll revisit this question once we have more data from workflow runs.

nextstrain / ncov-ingest

Do we need `uncompress_cog_uk_metadata` and `compress_cog_uk_metadata`? #450

Context