`available_cpu_cores()` is not quite right inside a CPU-limited container

Current Behavior

On the host:

$ python3 -c 'import augur.utils, os; print(augur.utils.available_cpu_cores(), len(os.sched_getaffinity(0)), os.cpu_count())'
8 8 8

In a CPU-limited container:

$ docker run --rm -it --cpus 4 nextstrain/base python3 -c 'import augur.utils, os; print(augur.utils.available_cpu_cores(), len(os.sched_getaffinity(0)), os.cpu_count())'
8 8 8

One would naively expect from the --cpus 4 option the output to be 4 4 4.

This discrepancy spotted in actual usage by @corneliusroemer in the context of an AWS Batch job with --cpus 8 detecting 96 cores for tree building. My diagnosis was:

Ah, good spotting. I think I know what happened here. I'd consider this a bug in Augur's CPU count detection. The job would have had access to only 8 CPUs worth of time (via --cpus 8), but was likely running on a 96 CPU host alongside other jobs. So I bet Augur's detecting the host's CPUs instead of what's available to it. I'm not sure if that's because Augur's call to os.sched_getaffinity(0)

https://github.com/nextstrain/augur/blob/40040eb00ca9e4a45ad42569a551ae9df8b3b155/augur/utils.py#L441-L444

is returning the "wrong" number (my guess is this is the case) or if it's failing and Augur falls back to os.cpu_count().

Expected behavior

Not sure. I think that due to the way container CPU time is limited (instead of CPU cores), the numbers aren't technically wrong. But they're certainly wrong in the sense that they will likely lead to oversubscribed processes and CPU contention.

Possible solution

Not sure about general solutions.

Workflows should, whenever possible, explicitly pass around CPU counts instead of relying on auto-detection (and if auto-detection is used, only use it at the very outermost workflow layer).

Related issues

https://github.com/nextstrain/cli/issues/175

nextstrain / augur