nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
35 stars 20 forks source link

Do we need `uncompress_cog_uk_metadata` and `compress_cog_uk_metadata`? #450

Open joverlee521 opened 2 weeks ago

joverlee521 commented 2 weeks ago

Context

A question that came up as I was working on #240: Do we need to uncompress/compress the COG UK metadata during the workflow?

The transform_genbank_metadata rule uses the gzipped COGUK metadata file directly. I do not see any other rule consuming the uncompressed COG UK metadata as input, so it seems like we are uncompressing/compressing for the sake of being able to have a copy on AWS S3 that is zstd compressed.

It's not clear how much resources these jobs actually take up since we don't have benchmark files (yet!). I'll revisit this question once we have more data from workflow runs.

joverlee521 commented 2 weeks ago

Ah, this might also be a result of our upload-to-s3 and download-from-s3 scripts not having the option to skip compression during transfer.