These options are applied within some of the retry policy code in the nf-azure plugin (e.g., in AzBatchTaskHandler).
However, these options do not appear to affect the retry policies for connections to blob storage, which means the defaults from the Azure SDK are used. We have encountered some TimeoutExceptions when attempting to download some data from blob storage, as described in https://github.com/nextflow-io/nextflow/issues/5067
This new feature request is to enable configuration of some of the retry policies for Azure blob storage connections in addition to other Azure API requests from Nextflow.
Usage scenario
The main usage scenario is to enable adjustment of Azure blob storage retry policies to help address connection issues (in particular, https://github.com/nextflow-io/nextflow/issues/5067). These could be configured as Nextflow configuration options in the azure scope. These could be made available for adjustments either by:
Re-using the existing azure.retryPolicy.* options that already exist, but apply them to Azure blob storage connections
Or, creating a second set of azure.storage.retryPolicy.* options just for configuring blob storage retry policies
Suggest implementation
Currently, the AzHelper class is used to create and configure different instances of BlobServiceClient:
This code can be adapted to configure custom retry policies for blob storage by adding a retryOptions() line to the builder class:
return new BlobServiceClientBuilder()
.credential(credential)
.endpoint(endpoint)
// Add below line
.retryOptions(new RequestRetryOptions(RetryPolicyType.EXPONENTIAL, maxTries, tryTimeoutInSeconds, null, null, null))
// Add above line
.buildClient()
This makes use of an instance of RequestRetryOptions in the Azure Java SDK. The values to pass can be derived from the Nextflow configuration.
One thing I am unsure about with regards to the Azure SDK is this bit of text from the RequestRetryOptions:
tryTimeoutInSeconds - Optional. Specified the maximum time allowed before a request is cancelled and assumed failed, default is Integer#MAX_VALUE s. This value should be based on the bandwidth available to the host machine and proximity to the Storage service, a good starting point may be 60 seconds per MB of anticipated payload size.
In my tests in https://github.com/nextflow-io/nextflow/issues/5067 I set this value to 60 seconds and it seemed to solve our problem with timeouts when downloading the .exitcode file at the end of a process run in Azure. However, the above documentation suggests this should be adjusted based on the payload size. So, I'm not sure if it's best for the suggested implementation above to leave this as the default Integer.MAX_VALUE since the payload size may be variable. However, would this mean that retrying the connection would not occur then (since it's never going to hit the Integer.MAX_VALUE seconds)? I'm still a bit uncertain about how this all works.
New feature
When using Azure batch in Nextflow, there currently exists the options
azure.retryPolicy.*
for configuring the behaviour of Azure API request retries: https://www.nextflow.io/docs/latest/config.html#config-azureThese options are applied within some of the retry policy code in the
nf-azure
plugin (e.g., in AzBatchTaskHandler).However, these options do not appear to affect the retry policies for connections to blob storage, which means the defaults from the Azure SDK are used. We have encountered some
TimeoutException
s when attempting to download some data from blob storage, as described in https://github.com/nextflow-io/nextflow/issues/5067This new feature request is to enable configuration of some of the retry policies for Azure blob storage connections in addition to other Azure API requests from Nextflow.
Usage scenario
The main usage scenario is to enable adjustment of Azure blob storage retry policies to help address connection issues (in particular, https://github.com/nextflow-io/nextflow/issues/5067). These could be configured as Nextflow configuration options in the
azure
scope. These could be made available for adjustments either by:azure.retryPolicy.*
options that already exist, but apply them to Azure blob storage connectionsazure.storage.retryPolicy.*
options just for configuring blob storage retry policiesSuggest implementation
Currently, the
AzHelper
class is used to create and configure different instances ofBlobServiceClient
:https://github.com/nextflow-io/nextflow/blob/f3a86def48d79a77770fec2b71f3fdc396d75dc9/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzHelper.groovy#L215-L218
This code can be adapted to configure custom retry policies for blob storage by adding a
retryOptions()
line to the builder class:This makes use of an instance of RequestRetryOptions in the Azure Java SDK. The values to pass can be derived from the Nextflow configuration.
One thing I am unsure about with regards to the Azure SDK is this bit of text from the RequestRetryOptions:
In my tests in https://github.com/nextflow-io/nextflow/issues/5067 I set this value to 60 seconds and it seemed to solve our problem with timeouts when downloading the
.exitcode
file at the end of a process run in Azure. However, the above documentation suggests this should be adjusted based on the payload size. So, I'm not sure if it's best for the suggested implementation above to leave this as the defaultInteger.MAX_VALUE
since the payload size may be variable. However, would this mean that retrying the connection would not occur then (since it's never going to hit theInteger.MAX_VALUE
seconds)? I'm still a bit uncertain about how this all works.