Open tsibley opened 2 years ago
I expect this will cause oversubscription for the actual CPU time available to the Batch job, but it won't always be so because Snakemake is dealing in the number of processes active at one time within the container and the Batch job's CPU limitation is dealing in cumulative CPU time spent in a period…
Relevant for thinking about this mismatch: https://danluu.com/cgroup-throttling/
(Dan Luu's analyses/write-ups are generally great, and I'd seen and read this one before, but it came back up again for me over the weekend, and I recalled that it applied to this issue.)
Require --cpus for the AWS Batch runner. Downside is this is a breaking change and erodes the runtime portability properties of nextstrain build's interface.
This is my preference
Hmm. Even if the other solution:
Explicitly inform Snakemake of the Batch job description's CPU allocation (
vcpus
), since that's what will apply in the absence of--cpus
. Downside is this breaks boundaries between thebuild
command and the Batch runner: the former is what informs Snakemake but the latter is what knows about Batch job definitions.
is possible without breaking boundaries (i.e. by redrawing boundaries)?
I'd be inclined to redraw the boundaries so that we can cleanly use the default from Batch for --cpus
.
Current Behavior
In the absence of a
--cpus
option provided tonextstrain build
, the AWS Batch runner uses the job definition's default CPU allocation (vcpus
) which is not detectable by Snakemake. Instead, Snakemake detects and uses the number of CPUs on the entire Batch node.I expect this will cause oversubscription for the actual CPU time available to the Batch job, but it won't always be so because Snakemake is dealing in the number of processes active at one time within the container and the Batch job's CPU limitation is dealing in cumulative CPU time spent in a period (expressed relative to a single pegged CPU in that same period) across the whole container process tree. So I think oversubscription will mostly be an issue when all the Snakemake processes peg their CPUs at the same time (e.g. alignment, tree building).
Expected behavior
Snakemake is limited to the same number of CPUs that are nominally provisioned by the Batch job definition when
--cpus
isn't provided.How to reproduce
Run this Snakefile under various conditions on AWS Batch and observe that
workflow.cores
reports the number of CPUs of the underlying Batch node if you pass--cores all
.Possible solution
vcpus
), since that's what will apply in the absence of--cpus
. Downside is this breaks boundaries between thebuild
command and the Batch runner: the former is what informs Snakemake but the latter is what knows about Batch job definitions.--cpus
for the AWS Batch runner. Downside is this is a breaking change and erodes the runtime portability properties ofnextstrain build
's interface.Workaround
Always specify
--cpus
fornextstrain build
when using AWS Batch, as this both reserves CPU time for the Batch job and informs Snakemake of the correct--cores
number.Additional context
Realized this was happening when implementing #179.
Solution possibly related to https://github.com/nextstrain/cli/issues/144. It at least will touch the same code.