Closed drpatelh closed 4 days ago
I think what happens here is that since the task does not event start, it enters this branch, and it is reported as a non-recoverable (retryable) error.
Retry is already supported by setting process.errorStrategy = 'retry'
😘
» nextflow run pditommaso/hello -c awsbatch.config -ansi-log false
N E X T F L O W ~ version 24.05.0-edge
Launching `https://github.com/pditommaso/hello` [berserk_goldstine] DSL2 - revision: 5df6c0103b [master]
[38/6154b3] Submitted process > sayHello
[38/6154b3] NOTE: Process `sayHello` failed -- Execution is retried (1)
[1b/33cb9b] Re-submitted process > sayHello
ERROR ~ Error executing process > 'sayHello'
Caused by:
Task failed to start - CannotPullContainerError: Error response from daemon: unauthorized: access to the requested resource is not authorized
Command executed:
echo "Hello world!"
sleep 0
exit 0
New feature
AWS Batch, by default, will attempt to schedule multiple tasks on the same VM. This is great from a cost/time perspective but can have some unwanted side effects.
We consistently see
CannotPullContainerError
errors, which we believe result from exhausting the boot disk on the VM (default: 30 GB) by attempting to source the Docker containers for all scheduled tasks.The last 5 pipeline errors we see that were submitted on subsequent days:
An example of a failed pipeline can be found in the
community/showcase
Workspace on Seqera Platform: https://cloud.seqera.io/orgs/community/workspaces/showcase/watch/3EdFagC3x2Q6yKI have uploaded the
.nextflow.log
file here for convenience: nf-3EdFagC3x2Q6yK.logUsage scenario
Running NF pipelines on AWS Batch will pack multiple tasks in the same instance by default.
Suggest implementation
I believe the current NF retry strategy will not work to catch this particular error because the task doesn't generate an exit code and fails to start.
It would be great to automatically retry a task submission by catching the
CannotPullContainerError
error. If this error is too generic to catch, we could also look more specifically forno space left on device
.