nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.61k stars 605 forks source link

[Azure Batch]: TimeoutException when Nextflow attempts to read files following task completion for large-scale analyses in Azure #5067

Open apetkau opened 2 weeks ago

apetkau commented 2 weeks ago

Bug report

We are using Nextflow with Azure batch to process collections of microbial genomes (whole-genome sequence data). We have began testing out processing larger collections of genomes and have been encountering issues with some of the tasks run by Nextflow that cause Nextflow to fail the task with a java.util.concurrent.TimeoutException. This primarily occurs when attempting to read the .exitcode from a task from blob storage, which causes Nextflow to fail the task/return Integer.MAX_INTEGER as the exit code. For example (see below for more context in error message):

Jun... [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun... [TaskFinalizer-10] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...); work-dir=az://...
  error [nextflow.exception.ProcessFailedException]: Process `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` terminated for an unknown reason -- Likely it has been terminated by the external system

This behavior occurs only when scaling up to many genomes, and only impacts some of our runs. It also seems to impact random tasks/processes (in the above case, it is when running QUAST, but it occurs for random processes in the full pipeline). The pipeline we are using is https://github.com/phac-nml/mikrokondo/. I have observed it in other pipelines, but much less frequently. I believe it occurs in this pipeline since it does a lot of processing/may take up to 2 hours to process a genome.

Expected behavior and actual behavior

I would expect all of our pipeline executions to complete successfully and for there to be no TimeoutExceptions when reading outputs of a task (e.g., the .exitcode file in Azure blob storage).

Steps to reproduce the problem

As this is an issue that occurs mainly with large-scale analysis of genomes within Azure, and does not happen every time, it is a bit more difficult to provide a specific set of steps to reproduce the issue. However, here is a rough sketch:

  1. Use pipeline: https://github.com/phac-nml/mikrokondo (version 0.2.1)
  2. Run multiple instances of this pipeline using Azure batch, perhaps 10 to 20, one instance per microbial genome (whole-genome sequences).
    1. The rate I have observed is ~10% of the runs will fail with the below exception.
    2. It is important to use real-world sized genomes (e.g., from NCBI). The specific data I am testing with is a selection of Illumina paired-end sequence reads of Salmonella genomes from this BioProject PRJNA305824.
  3. Monitor runs. Roughly %10 will fail with the below timeout exceptions. May take up to 2 hours or more for failure. I have attempted to test out on smaller datasets that run quickly, but it's much harder to reproduce failures in that case.

Program output

I am unfortunately unable to share the full nextflow.log, but here is the relevant section.

Jun... [reactor-http-epoll-2] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
    at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
    at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)
Jun... [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
Jun... [pool-11-thread-1] INFO  c.m.a.adal4j.AuthenticationAuthority - [Correlation ID: ...] Instance discovery was successful
Jun... [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 18; name: MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...); status: COMPLETED; exit: -; error: -; workDir: az://...]
Jun... [TaskFinalizer-10] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...); work-dir=az://...
  error [nextflow.exception.ProcessFailedException]: Process `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` terminated for an unknown reason -- Likely it has been terminated by the external system

What I have observed is that the below exception prints out 3 times, and then on the fourth time the task fails:

Jun... [reactor-http-epoll-2] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
    at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
    at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)

Note that the default timeout is 60,000 milliseconds (1 minute), but we modified it be 600,000 milliseconds (10 minutes) by adjusting the AZURE_REQUEST_RESPONSE_TIMEOUT environment variable as described here: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/core/azure-core/README.md#http-timeouts. This didn't fix the problem though, just increased the time before failure.

Environment

Additional context

We have spent a bit of time trying to identify the cause of this issue, but haven't been able to isolate and address it. However, we have some additional information on the locations in code where this is happening. I'm providing this information in case it is useful, but please feel free to skip the below (I might be wrong in some of this too).

1. TimeoutException failing a task

The TimeoutException that finally triggers failing a task is logged in the following line:

Jun... [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.

This occurs in this section of Nextflow code (the AzBatchTaskHandler):

https://github.com/nextflow-io/nextflow/blob/e6a5e17db91ca6fae875df42053bd8c1b2a23bd4/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchTaskHandler.groovy#L170-L178

That is, it's attempting to read the exit status via the exitFile.text property. The exitFile is of type AzPath, which is of type Path.

2. Reading from Path.text in Azure

I am not as familiar with how Groovy ultimately handles Path.text for reading and returning the contents of a file as text, but in Nextflow with Azure I think this ultimately runs this bit of code in AzFileSystem to open up a stream to the file in blob storage:

https://github.com/nextflow-io/nextflow/blob/e6a5e17db91ca6fae875df42053bd8c1b2a23bd4/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystem.groovy#L201-L217

3. BlobClient.openInputStream()

Line 208 above runs client.openInputStream(). The client object is of type BlobClient, which extends from BlobClientBase and the openInputStream() is defined here in the Azure Java SDK:

    /**
     * Opens a blob input stream to download the blob.
     *
     * @return An <code>InputStream</code> object that represents the stream to use for reading from the blob.
     * @throws BlobStorageException If a storage service error occurred.
     */
    public BlobInputStream openInputStream() {
        return openInputStream((BlobRange) null, null);
    }

https://github.com/Azure/azure-sdk-for-java/blob/421555531b3e83a5df3ca605653c46f8c9c7d6de/sdk/storage/azure-storage-blob/src/main/java/com/azure/storage/blob/specialized/BlobClientBase.java#L292-L300

4. Azure Java SDK HTTP Pipeline

The Azure Java SDK seems to use an HTTP Pipeline for defining different steps to take when handling an API request to Azure: https://learn.microsoft.com/en-us/azure/developer/java/sdk/http-client-pipeline#http-pipeline

In particular, there exists configurable RetryPolicies that are part of the Azure SDK: https://learn.microsoft.com/en-us/azure/developer/java/sdk/http-client-pipeline#common-http-pipeline-policies

5. Azure/HTTP Pipeline retry policies

This brings us to the other part of the exception in the Nextflow log files:

Jun... [reactor-http-epoll-2] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
    at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
    at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)

This appears to be triggered by this code:

    void responseTimedOut(ChannelHandlerContext ctx) {
        if (!closed) {
            disposeResponseTimeoutWatcher();
            ctx.fireExceptionCaught(
                new TimeoutException("Channel response timed out after " + responseTimeoutMillis + " milliseconds."));
            ctx.close();
            closed = true;
        }
    }

https://github.com/Azure/azure-sdk-for-java/blob/421555531b3e83a5df3ca605653c46f8c9c7d6de/sdk/core/azure-core-http-netty/src/main/java/com/azure/core/http/netty/implementation/AzureSdkHandler.java#L199-L207

6. Nextflow retry policies

I do know there are azure.retryPolicy.* parameters that can be adjusted in the Nextflow config: https://www.nextflow.io/docs/latest/config.html#config-azure. We have tried adjusting them, but I'm guessing in this case when reading the .exitcode file from Azure blob storage, these policies aren't being applied, and so it's defaulting to the retry policies that are configured as default within the Azure SDK. I'm not sure if this is expected to cause any issues?

Also, some of the above description of the code might be incorrect. I have been running up against my own lack of knowledge of how everything works, which makes it hard to debug.

apetkau commented 2 weeks ago

I forgot to add, but while most of the above TimeoutExceptions occur when attempting to read the .exitCode file, sometimes they have occured when attempting to get the task trace record. Below is an exception from another run (different run from the exception I provided above):

Jun... [TaskFinalizer-1] DEBUG nextflow.processor.TaskHandler - Unable to get task trace record -- cause: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
    at reactor.core.Exceptions.propagate(Exceptions.java:396)
    at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:98)
    at reactor.core.publisher.Mono.block(Mono.java:1742)
    at com.azure.storage.common.implementation.StorageImplUtils.blockWithOptionalTimeout(StorageImplUtils.java:147)
    at com.azure.storage.blob.specialized.BlobClientBase.getPropertiesWithResponse(BlobClientBase.java:1379)
    at com.azure.storage.blob.specialized.BlobClientBase.getProperties(BlobClientBase.java:1348)
    at nextflow.cloud.azure.nio.AzFileSystem.newReadableByteChannel(AzFileSystem.groovy:207)
    at nextflow.cloud.azure.nio.AzFileSystemProvider.newByteChannel(AzFileSystemProvider.groovy:407)
    at java.base/java.nio.file.Files.newByteChannel(Files.java:380)
    at java.base/java.nio.file.Files.newByteChannel(Files.java:432)
    at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422)
    at java.base/java.nio.file.Files.newInputStream(Files.java:160)
    at java.base/java.nio.file.Files.newBufferedReader(Files.java:2922)
    at org.apache.groovy.nio.extensions.NioExtensions.newReader(NioExtensions.java:1436)
    at org.apache.groovy.nio.extensions.NioExtensions.getText(NioExtensions.java:400)
    at nextflow.trace.TraceRecord.parseTraceFile(TraceRecord.groovy:422)
    at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
    at nextflow.processor.TaskHandler.getTraceRecord(TaskHandler.groovy:217)
    at nextflow.cloud.azure.batch.AzBatchTaskHandler.getTraceRecord(AzBatchTaskHandler.groovy:189)
    at nextflow.processor.TaskHandler.safeTraceRecord(TaskHandler.groovy:160)
    at nextflow.Session.notifyTaskComplete(Session.groovy:1044)
    at nextflow.processor.TaskPollingMonitor.finalizeTask(TaskPollingMonitor.groovy:679)
    at jdk.internal.reflect.GeneratedMethodAccessor223.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
    at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
    at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
    at nextflow.processor.TaskPollingMonitor$_checkTaskStatus_lambda8.doCall(TaskPollingMonitor.groovy:666)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
    Suppressed: java.lang.Exception: #block terminated with an error
        at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:100)
        ... 37 common frames omitted
Caused by: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
    at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
    at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    ... 1 common frames omitted
    Suppressed: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
        at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
        at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413)
        ... 4 common frames omitted
    Suppressed: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
        at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
        at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413)
        ... 4 common frames omitted
    Suppressed: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
        ... 12 common frames omitted

Here, there's more information in the stack trace, including that the Path.text does ultimately trigger the code AzFileSystem.newReadableByteChannel():

    at nextflow.cloud.azure.nio.AzFileSystem.newReadableByteChannel(AzFileSystem.groovy:207)
    at nextflow.cloud.azure.nio.AzFileSystemProvider.newByteChannel(AzFileSystemProvider.groovy:407)
    at java.base/java.nio.file.Files.newByteChannel(Files.java:380)
    at java.base/java.nio.file.Files.newByteChannel(Files.java:432)
    at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422)
    at java.base/java.nio.file.Files.newInputStream(Files.java:160)
    at java.base/java.nio.file.Files.newBufferedReader(Files.java:2922)
    at org.apache.groovy.nio.extensions.NioExtensions.newReader(NioExtensions.java:1436)
    at org.apache.groovy.nio.extensions.NioExtensions.getText(NioExtensions.java:400)
    at nextflow.trace.TraceRecord.parseTraceFile(TraceRecord.groovy:422)

That is, here is TraceRecord.parseTraceFile():

https://github.com/nextflow-io/nextflow/blob/e6a5e17db91ca6fae875df42053bd8c1b2a23bd4/modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy#L420-L424

And here is AzFileSystem.newReadableByteChannel():

https://github.com/nextflow-io/nextflow/blob/e6a5e17db91ca6fae875df42053bd8c1b2a23bd4/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystem.groovy#L201-L207

In this case, the exception appears to occur when getting the properties of the file (which might be why the stack trace is different).

pditommaso commented 1 week ago

Thanks for the detailed reporting. You are right, there's no retry on nextflow side since it's expected to be retried by the underlying azure SDK.

it looks like more an issue on the SDK implementation. Not sure how to address it.

apetkau commented 1 week ago

Thanks so much for the response @pditommaso 😄

We are still working out the exact cause of the issue/a solution for it. As a bit of an update in case what we've tried is useful to others.

Increased Azure HTTP Logging

We increased the Azure HTTP logging by setting AZURE_HTTP_LOG_DETAIL_LEVEL=HEADERS (as described in https://learn.microsoft.com/en-us/azure/developer/java/sdk/logging-overview#configure-http-logging-with-an-environment-variable. This gives us a bit more details in the Nextflow logs:

Jun-18 17:52:30.142 [Task monitor] DEBUG n.cloud.azure.batch.AzBatchService - Azure TooManyRequests response error - attempt: 1; reason: java.net.SocketTimeoutException: timeout
Jun-18 17:54:30.429 [Task monitor] DEBUG n.cloud.azure.batch.AzBatchService - Azure TooManyRequests response error - attempt: 2; reason: java.net.SocketTimeoutException: timeout
Jun-18 17:54:30.532 [Task monitor] INFO  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":1, ...}
Jun-18 18:04:30.577 [reactor-http-epoll-1] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:04:30.605 [reactor-http-epoll-1] WARN  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:04:34.609 [parallel-3] INFO  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":2}
Jun-18 18:14:34.712 [reactor-http-epoll-4] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:14:34.714 [reactor-http-epoll-4] WARN  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:14:46.716 [parallel-4] INFO  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":3, ...}
Jun-18 18:24:46.732 [reactor-http-epoll-2] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:24:46.734 [reactor-http-epoll-2] WARN  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:25:14.736 [parallel-1] INFO  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":4, ...}
Jun-18 18:35:14.837 [reactor-http-epoll-3] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:35:14.838 [reactor-http-epoll-3] WARN  c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:35:14.844 [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
Jun-18 18:35:14.992 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 6; name: PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1); status: COMPLETED; exit: -; error: -; workDir: az://...]
Jun-18 18:35:15.024 [TaskFinalizer-6] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1); work-dir=az://...
  error [nextflow.exception.ProcessFailedException]: Process `PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1)` terminated for an unknown reason -- Likely it has been terminated by the external system

This suggests the timeout is happing when running BlobClient.getProperties() in order to get the blob file size (line 207 below):

https://github.com/nextflow-io/nextflow/blob/e6a5e17db91ca6fae875df42053bd8c1b2a23bd4/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystem.groovy#L201-L209

When we look into recorded requests to the .exitcode in the Azure Blob storage end we aren't seeing any of these requests. As if something is preventing the request from making it to Azure (or from Azure recording the request).

Also, the TooManyRequests message at the top of this log entry is printed as part of the Nextflow Azure retry policies:

https://github.com/nextflow-io/nextflow/blob/e6a5e17db91ca6fae875df42053bd8c1b2a23bd4/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy#L917-L924

But, whatever request this was makes it through after a few attempts.

Increased number of retries

We are currently attempting to increase the number of retries (by adjusting the AZURE_REQUEST_RETRY_COUNT environment variable) instead of the timeout for each retry to see if this helps.

pditommaso commented 1 week ago

The Azure Blob SDK library was updated in the latest Nextflow Edge version, 24.05.0-edge. You may want to give it a try.

Another possibility would be to use, the Batch task exit code via the Batch API to determine the task status. But I was not able to make it work

https://github.com/nextflow-io/nextflow/blob/970906737a6c207f8045483badff729b21cd1c42/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchTaskHandler.groovy#L118-L118

apetkau commented 1 week ago

Thanks so much @pditommaso 😄 .

We are actually able to get it to work, but required some changes to the nf-azure plugin to allow us to increase the maximum number of retries in the BlobServiceClient.

Trying to increase the maximum number of tries via the AZURE_REQUEST_RETRY_COUNT environment variable didn't work. It remained at 4 (I believe this comes from the default value set here: https://github.com/Azure/azure-sdk-for-java/blob/a17d6744deb0dd6501e62ca8356c839799423d79/sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/policy/RequestRetryOptions.java#L101)

However, by modifying the BlobServiceClientBuilder code that is part of nf-azure, we were able to make adjustments to the retry options for blob connections. This is specifically configured in this bit of code in nf-azure:

https://github.com/nextflow-io/nextflow/blob/f3a86def48d79a77770fec2b71f3fdc396d75dc9/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzHelper.groovy#L215-L218

If this is changed to below to add a RequestRetryOptions as part of the configuration, then the retry options for connections can be adjusted:

        return new BlobServiceClientBuilder()
                .credential(credential)
                .endpoint(endpoint)
                .retryOptions(new RequestRetryOptions(RetryPolicyType.EXPONENTIAL, maxTries, tryTimeoutInSeconds, null, null, null))
                .buildClient()

You can see all the changes I made here https://github.com/apetkau/nextflow/compare/v24.04.2...nf-azure-1.6.0-nmlpatch0 (everything else is additional logging statements to help us debug).

I am wondering if nf-azure could be updated to include the ability to configure these retry options for BlobServiceClient? Would it make the most sense to pass in the existing values from the Nextflow config for Azure here (the azure.retryPolicy.maxAttempts and so on)?

Please let me know if you'd prefer I made a new feature request issue for this.

pditommaso commented 1 week ago

That's a great feedback. Thank you so much. Tagging @bentsherman and @adamrtalbot for visibility

vsmalladi commented 1 day ago

@apetkau if you put in a feature request we can see if we can help implement.

apetkau commented 14 hours ago

Thanks @vsmalladi, I've made a feature request here: https://github.com/nextflow-io/nextflow/issues/5097