Allow configuration of additional Azure retry options for connection to blob storage in nf-azure

apetkau commented 3 months ago

New feature

When using Azure batch in Nextflow, there currently exists the options azure.retryPolicy.* for configuring the behaviour of Azure API request retries: https://www.nextflow.io/docs/latest/config.html#config-azure

These options are applied within some of the retry policy code in the nf-azure plugin (e.g., in AzBatchTaskHandler).

However, these options do not appear to affect the retry policies for connections to blob storage, which means the defaults from the Azure SDK are used. We have encountered some TimeoutExceptions when attempting to download some data from blob storage, as described in https://github.com/nextflow-io/nextflow/issues/5067

This new feature request is to enable configuration of some of the retry policies for Azure blob storage connections in addition to other Azure API requests from Nextflow.

Usage scenario

The main usage scenario is to enable adjustment of Azure blob storage retry policies to help address connection issues (in particular, https://github.com/nextflow-io/nextflow/issues/5067). These could be configured as Nextflow configuration options in the azure scope. These could be made available for adjustments either by:

Re-using the existing azure.retryPolicy.* options that already exist, but apply them to Azure blob storage connections
Or, creating a second set of azure.storage.retryPolicy.* options just for configuring blob storage retry policies

Suggest implementation

Currently, the AzHelper class is used to create and configure different instances of BlobServiceClient:

https://github.com/nextflow-io/nextflow/blob/f3a86def48d79a77770fec2b71f3fdc396d75dc9/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzHelper.groovy#L215-L218

This code can be adapted to configure custom retry policies for blob storage by adding a retryOptions() line to the builder class:

return new BlobServiceClientBuilder()
        .credential(credential)
        .endpoint(endpoint)
        // Add below line
        .retryOptions(new RequestRetryOptions(RetryPolicyType.EXPONENTIAL, maxTries, tryTimeoutInSeconds, null, null, null))
        // Add above line
        .buildClient()

This makes use of an instance of RequestRetryOptions in the Azure Java SDK. The values to pass can be derived from the Nextflow configuration.

One thing I am unsure about with regards to the Azure SDK is this bit of text from the RequestRetryOptions:

tryTimeoutInSeconds - Optional. Specified the maximum time allowed before a request is cancelled and assumed failed, default is Integer#MAX_VALUE s. This value should be based on the bandwidth available to the host machine and proximity to the Storage service, a good starting point may be 60 seconds per MB of anticipated payload size.

In my tests in https://github.com/nextflow-io/nextflow/issues/5067 I set this value to 60 seconds and it seemed to solve our problem with timeouts when downloading the .exitcode file at the end of a process run in Azure. However, the above documentation suggests this should be adjusted based on the payload size. So, I'm not sure if it's best for the suggested implementation above to leave this as the default Integer.MAX_VALUE since the payload size may be variable. However, would this mean that retrying the connection would not occur then (since it's never going to hit the Integer.MAX_VALUE seconds)? I'm still a bit uncertain about how this all works.

pditommaso commented 3 months ago

I've made #5098 to address this issue. Regarding tryTimeoutInSeconds, I believe it's preferable to keep the default setting.

apetkau commented 3 months ago

@pditommaso thanks so much

pditommaso commented 3 months ago

You are welcome

nextflow-io / nextflow