nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
35 stars 20 forks source link

Bump CPUs for fetch-and-ingest workflows #447

Closed joverlee521 closed 3 weeks ago

joverlee521 commented 3 weeks ago

I've been only bumping the memory but not the CPUs for the fetch-and-ingest workflows. Might as well use all the compute that we are paying for. GenBank should be using c5.9xlarge and GISAID should be using c5.12xlarge, so bumping CPUs to match the instances.¹

Maybe this will magically help https://github.com/nextstrain/ncov-ingest/issues/446?

¹ https://aws.amazon.com/ec2/instance-types/c5/

Checklist

joverlee521 commented 3 weeks ago

Last nights run finished under 12h 🎉 I'm going to dig into the logs a little bit, but at least this is an improvement so I'm merging this ahead of today's run.

corneliusroemer commented 3 weeks ago

We could probably scrap those CPU limits altogether. All they do is make snakemake restrict the number of jobs run in parallel.

This has obvious downsides and only rare benefits.

The CPU scheduler figures out ways to give all jobs some share when oversubscribed. We don't really need snakemake to do pessimistic scheduling.

joverlee521 commented 3 weeks ago

We could probably scrap those CPU limits altogether.

For the aws batch runtime, the --cpus option is used to override the default nextstrain-job definition, which is only 4 cpus.

tsibley commented 3 weeks ago

The CPU scheduler figures out ways to give all jobs some share when oversubscribed.

Yes, progress will still be made with oversubscription, but the run time of the whole workflow will increase, sometimes substantially, depending on the kind of workload. It's still better not to oversubscribe when you can avoid it.