nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.62k stars 606 forks source link

AWS batch "too many requests" errors are not retriable #3078

Open jwarwick-delfi opened 1 year ago

jwarwick-delfi commented 1 year ago

Hello. Regarding this line and the similar logic in the functions around it:

https://github.com/nextflow-io/nextflow/blob/b099d430aa06b350ff63b6f3ae291dd72f49c779/plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchTaskHandler.groovy#L767

We have seen quite a few TooManyRequests exceptions when Batch is under high load. They should be recoverable / retryable, but nextflow crashes out because the error code is 429.

Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 4ba5587d-f670-4c0e-986f-a24046298d69; Proxy: null)
bentsherman commented 1 year ago

To my understanding, the AWS Java SDK (used by Nextflow) automatically retries requests with 429 error code:

https://docs.aws.amazon.com/general/latest/gr/api-retries.html

So if you are getting this error from Nextflow then it means that the request has already been retried at least a few times. I don't think Nextflow currently allows these retry settings to be changed via nextflow.config, so it would be good to add those settings so that you can experiment with them.

For now, you can adjust executor.submitRateLimit and executor.pollInterval so that Nextflow calls the AWS Batch API less frequently.

jwarwick-delfi commented 1 year ago

Thanks, it would be good to have more flexibility here. We are running hundreds of nextflow instances simultaneously, so the rate limiting is only getting us so far.

bentsherman commented 1 year ago

Some notes after looking into this issue:

So I think we can implement some or all of the following options:

I don't see any way to provide a numerical value for the jitter. Also not sure if we should support the retry mode.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bentsherman commented 10 months ago

Nextflow has a config option aws.batch.retryMode (docs), which respects rate-limiting responses by default. However this setting is only applied to CLI commands used by tasks and not to the AWS SDK used by Nextflow.

For Nextflow, you should set AWS_RETRY_MODE=standard in your launch environment. Let me know if that helps.