Automatic retry with CannotPullContainerError

drpatelh commented 1 week ago

New feature

AWS Batch, by default, will attempt to schedule multiple tasks on the same VM. This is great from a cost/time perspective but can have some unwanted side effects.

We consistently see CannotPullContainerError errors, which we believe result from exhausting the boot disk on the VM (default: 30 GB) by attempting to source the Docker containers for all scheduled tasks.

The last 5 pipeline errors we see that were submitted on subsequent days:

Caused by:
  Task failed to start - CannotPullContainerError: write /var/lib/docker/tmp/GetImageBlob887491687: no space left on device

Caused by:
  Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/x86_64-conda-linux-gnu/sysroot/usr/lib64/locale/locale-archive.tmpl: no space left on device

Caused by:
  Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/bin/ripples-fast: no space left on device

Caused by:
  Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/share/doc/gettext/examples/hello-c++-kde/admin/am_edit: no space left on device

Caused by:
  Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/lib/libopenblasp-r0.3.10.so: no space left on device

An example of a failed pipeline can be found in the community/showcase Workspace on Seqera Platform: https://cloud.seqera.io/orgs/community/workspaces/showcase/watch/3EdFagC3x2Q6yK

I have uploaded the .nextflow.log file here for convenience: nf-3EdFagC3x2Q6yK.log

Usage scenario

Running NF pipelines on AWS Batch will pack multiple tasks in the same instance by default.

Suggest implementation

I believe the current NF retry strategy will not work to catch this particular error because the task doesn't generate an exit code and fails to start.

It would be great to automatically retry a task submission by catching the CannotPullContainerError error. If this error is too generic to catch, we could also look more specifically for no space left on device.

pditommaso commented 1 week ago

I think what happens here is that since the task does not event start, it enters this branch, and it is reported as a non-recoverable (retryable) error.

https://github.com/nextflow-io/nextflow/blob/284a66063767b67504aba814d661375eae6fab2f/plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchTaskHandler.groovy#L270-L271

pditommaso commented 4 days ago

Retry is already supported by setting process.errorStrategy = 'retry' 😘

» nextflow  run pditommaso/hello -c awsbatch.config -ansi-log false
N E X T F L O W  ~  version 24.05.0-edge
Launching `https://github.com/pditommaso/hello` [berserk_goldstine] DSL2 - revision: 5df6c0103b [master]
[38/6154b3] Submitted process > sayHello
[38/6154b3] NOTE: Process `sayHello` failed -- Execution is retried (1)
[1b/33cb9b] Re-submitted process > sayHello
ERROR ~ Error executing process > 'sayHello'

Caused by:
  Task failed to start - CannotPullContainerError: Error response from daemon: unauthorized: access to the requested resource is not authorized

Command executed:

  echo "Hello world!"
  sleep 0
  exit 0

nextflow-io / nextflow