nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
35 stars 20 forks source link

GISAID ingest keeps running out of memory #217

Open eharkins opened 2 years ago

eharkins commented 2 years ago

These workflows:

keep running out of memory on AWS and being killed, e.g. https://github.com/nextstrain/ncov-ingest/runs/3764464129?check_suite_focus=true.

This likely happens during the run of https://github.com/nextstrain/ncov-ingest/blob/master/bin/transform-gisaid since it takes gisaid.ndjson (the raw GISAID full dataset, which is over 100GB) as input, and performs a bunch of operations on it.

To avoid continually increasing the resources we ask for on the batch job, here are some ideas:

@tsibley said:

We should also understand why the memory needs increased even though the core of the ETL is streaming, and maybe also consider running these on m5 instance family instead of c5. (which could be as small a change as adding m5 instances to the job queue used.)

@rneher said:

there are some lengthy compression steps that could happen in parallel to the rest of the pipeline. the gzip compression is about 1h. Changing the compression of the ndjson to xz -2 already saved a lot.

eharkins commented 2 years ago

Last increase in memory we ask for was earlier this week: https://github.com/nextstrain/ncov-ingest/commit/5db5d2574210d60b4ae6434248af64dd3187a781. Maybe we should raise it again for now while we implement a more scalable solution?