nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

`available_cpu_cores()` is not quite right inside a CPU-limited container #1058

Closed tsibley closed 2 years ago

tsibley commented 2 years ago

Current Behavior

On the host:

$ python3 -c 'import augur.utils, os; print(augur.utils.available_cpu_cores(), len(os.sched_getaffinity(0)), os.cpu_count())'
8 8 8

In a CPU-limited container:

$ docker run --rm -it --cpus 4 nextstrain/base python3 -c 'import augur.utils, os; print(augur.utils.available_cpu_cores(), len(os.sched_getaffinity(0)), os.cpu_count())'
8 8 8

One would naively expect from the --cpus 4 option the output to be 4 4 4.

This discrepancy spotted in actual usage by @corneliusroemer in the context of an AWS Batch job with --cpus 8 detecting 96 cores for tree building. My diagnosis was:

Ah, good spotting. I think I know what happened here. I'd consider this a bug in Augur's CPU count detection. The job would have had access to only 8 CPUs worth of time (via --cpus 8), but was likely running on a 96 CPU host alongside other jobs. So I bet Augur's detecting the host's CPUs instead of what's available to it. I'm not sure if that's because Augur's call to os.sched_getaffinity(0)

https://github.com/nextstrain/augur/blob/40040eb00ca9e4a45ad42569a551ae9df8b3b155/augur/utils.py#L441-L444

is returning the "wrong" number (my guess is this is the case) or if it's failing and Augur falls back to os.cpu_count().

Expected behavior

Not sure. I think that due to the way container CPU time is limited (instead of CPU cores), the numbers aren't technically wrong. But they're certainly wrong in the sense that they will likely lead to oversubscribed processes and CPU contention.

Possible solution

Not sure about general solutions.

Workflows should, whenever possible, explicitly pass around CPU counts instead of relying on auto-detection (and if auto-detection is used, only use it at the very outermost workflow layer).

Related issues

tsibley commented 2 years ago

Discussion during triage is that, barring a robust way to detect limitations on CPU time, there's likely not a way to address this inside Augur itself. Closing as can't fix for now. We can use the issue to collect references/examples of the problem cropping up (e.g. like the monkeypox workflow fix above). And re-open if we ever have a way to fix.