GISAID ingest keeps running out of memory

These workflows:

https://github.com/nextstrain/ncov-ingest/blob/master/.github/workflows/fetch-and-ingest-gisaid-master.yml
https://github.com/nextstrain/ncov-ingest/blob/master/.github/workflows/ingest-gisaid-master.yml
and any that run https://github.com/nextstrain/ncov-ingest/blob/master/bin/ingest-gisaid

keep running out of memory on AWS and being killed, e.g. https://github.com/nextstrain/ncov-ingest/runs/3764464129?check_suite_focus=true.

This likely happens during the run of https://github.com/nextstrain/ncov-ingest/blob/master/bin/transform-gisaid since it takes gisaid.ndjson (the raw GISAID full dataset, which is over 100GB) as input, and performs a bunch of operations on it.

To avoid continually increasing the resources we ask for on the batch job, here are some ideas:

@tsibley said:

We should also understand why the memory needs increased even though the core of the ETL is streaming, and maybe also consider running these on m5 instance family instead of c5. (which could be as small a change as adding m5 instances to the job queue used.)

@rneher said:

there are some lengthy compression steps that could happen in parallel to the rest of the pipeline. the gzip compression is about 1h. Changing the compression of the ndjson to xz -2 already saved a lot.

nextstrain / ncov-ingest

GISAID ingest keeps running out of memory #217