Open apetkau opened 2 weeks ago
I forgot to add, but while most of the above TimeoutException
s occur when attempting to read the .exitCode
file, sometimes they have occured when attempting to get the task trace record. Below is an exception from another run (different run from the exception I provided above):
Jun... [TaskFinalizer-1] DEBUG nextflow.processor.TaskHandler - Unable to get task trace record -- cause: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
at reactor.core.Exceptions.propagate(Exceptions.java:396)
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:98)
at reactor.core.publisher.Mono.block(Mono.java:1742)
at com.azure.storage.common.implementation.StorageImplUtils.blockWithOptionalTimeout(StorageImplUtils.java:147)
at com.azure.storage.blob.specialized.BlobClientBase.getPropertiesWithResponse(BlobClientBase.java:1379)
at com.azure.storage.blob.specialized.BlobClientBase.getProperties(BlobClientBase.java:1348)
at nextflow.cloud.azure.nio.AzFileSystem.newReadableByteChannel(AzFileSystem.groovy:207)
at nextflow.cloud.azure.nio.AzFileSystemProvider.newByteChannel(AzFileSystemProvider.groovy:407)
at java.base/java.nio.file.Files.newByteChannel(Files.java:380)
at java.base/java.nio.file.Files.newByteChannel(Files.java:432)
at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422)
at java.base/java.nio.file.Files.newInputStream(Files.java:160)
at java.base/java.nio.file.Files.newBufferedReader(Files.java:2922)
at org.apache.groovy.nio.extensions.NioExtensions.newReader(NioExtensions.java:1436)
at org.apache.groovy.nio.extensions.NioExtensions.getText(NioExtensions.java:400)
at nextflow.trace.TraceRecord.parseTraceFile(TraceRecord.groovy:422)
at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
at nextflow.processor.TaskHandler.getTraceRecord(TaskHandler.groovy:217)
at nextflow.cloud.azure.batch.AzBatchTaskHandler.getTraceRecord(AzBatchTaskHandler.groovy:189)
at nextflow.processor.TaskHandler.safeTraceRecord(TaskHandler.groovy:160)
at nextflow.Session.notifyTaskComplete(Session.groovy:1044)
at nextflow.processor.TaskPollingMonitor.finalizeTask(TaskPollingMonitor.groovy:679)
at jdk.internal.reflect.GeneratedMethodAccessor223.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
at nextflow.processor.TaskPollingMonitor$_checkTaskStatus_lambda8.doCall(TaskPollingMonitor.groovy:666)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Suppressed: java.lang.Exception: #block terminated with an error
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:100)
... 37 common frames omitted
Caused by: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 common frames omitted
Suppressed: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413)
... 4 common frames omitted
Suppressed: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413)
... 4 common frames omitted
Suppressed: java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
... 12 common frames omitted
Here, there's more information in the stack trace, including that the Path.text
does ultimately trigger the code AzFileSystem.newReadableByteChannel()
:
at nextflow.cloud.azure.nio.AzFileSystem.newReadableByteChannel(AzFileSystem.groovy:207)
at nextflow.cloud.azure.nio.AzFileSystemProvider.newByteChannel(AzFileSystemProvider.groovy:407)
at java.base/java.nio.file.Files.newByteChannel(Files.java:380)
at java.base/java.nio.file.Files.newByteChannel(Files.java:432)
at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422)
at java.base/java.nio.file.Files.newInputStream(Files.java:160)
at java.base/java.nio.file.Files.newBufferedReader(Files.java:2922)
at org.apache.groovy.nio.extensions.NioExtensions.newReader(NioExtensions.java:1436)
at org.apache.groovy.nio.extensions.NioExtensions.getText(NioExtensions.java:400)
at nextflow.trace.TraceRecord.parseTraceFile(TraceRecord.groovy:422)
That is, here is TraceRecord.parseTraceFile()
:
And here is AzFileSystem.newReadableByteChannel()
:
In this case, the exception appears to occur when getting the properties of the file (which might be why the stack trace is different).
Thanks for the detailed reporting. You are right, there's no retry on nextflow side since it's expected to be retried by the underlying azure SDK.
it looks like more an issue on the SDK implementation. Not sure how to address it.
Thanks so much for the response @pditommaso 😄
We are still working out the exact cause of the issue/a solution for it. As a bit of an update in case what we've tried is useful to others.
We increased the Azure HTTP logging by setting AZURE_HTTP_LOG_DETAIL_LEVEL=HEADERS
(as described in https://learn.microsoft.com/en-us/azure/developer/java/sdk/logging-overview#configure-http-logging-with-an-environment-variable. This gives us a bit more details in the Nextflow logs:
Jun-18 17:52:30.142 [Task monitor] DEBUG n.cloud.azure.batch.AzBatchService - Azure TooManyRequests response error - attempt: 1; reason: java.net.SocketTimeoutException: timeout
Jun-18 17:54:30.429 [Task monitor] DEBUG n.cloud.azure.batch.AzBatchService - Azure TooManyRequests response error - attempt: 2; reason: java.net.SocketTimeoutException: timeout
Jun-18 17:54:30.532 [Task monitor] INFO c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":1, ...}
Jun-18 18:04:30.577 [reactor-http-epoll-1] WARN r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:04:30.605 [reactor-http-epoll-1] WARN c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:04:34.609 [parallel-3] INFO c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":2}
Jun-18 18:14:34.712 [reactor-http-epoll-4] WARN r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:14:34.714 [reactor-http-epoll-4] WARN c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:14:46.716 [parallel-4] INFO c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":3, ...}
Jun-18 18:24:46.732 [reactor-http-epoll-2] WARN r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:24:46.734 [reactor-http-epoll-2] WARN c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:25:14.736 [parallel-1] INFO c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP request","method":"HEAD","url":"[...].exitcode","tryCount":4, ...}
Jun-18 18:35:14.837 [reactor-http-epoll-3] WARN r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun-18 18:35:14.838 [reactor-http-epoll-3] WARN c.a.s.b.i.B.getProperties - {"az.sdk.message":"HTTP FAILED","exception":"Channel response timed out after 600000 milliseconds.", ...}
Jun-18 18:35:14.844 [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
Jun-18 18:35:14.992 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 6; name: PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1); status: COMPLETED; exit: -; error: -; workDir: az://...]
Jun-18 18:35:15.024 [TaskFinalizer-6] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1); work-dir=az://...
error [nextflow.exception.ProcessFailedException]: Process `PHACNML_IRIDANEXT:IRIDANEXT:CUSTOM_DUMPSOFTWAREVERSIONS (1)` terminated for an unknown reason -- Likely it has been terminated by the external system
This suggests the timeout is happing when running BlobClient.getProperties()
in order to get the blob file size (line 207 below):
When we look into recorded requests to the .exitcode
in the Azure Blob storage end we aren't seeing any of these requests. As if something is preventing the request from making it to Azure (or from Azure recording the request).
Also, the TooManyRequests
message at the top of this log entry is printed as part of the Nextflow Azure retry policies:
But, whatever request this was makes it through after a few attempts.
We are currently attempting to increase the number of retries (by adjusting the AZURE_REQUEST_RETRY_COUNT
environment variable) instead of the timeout for each retry to see if this helps.
The Azure Blob SDK library was updated in the latest Nextflow Edge version, 24.05.0-edge. You may want to give it a try.
Another possibility would be to use, the Batch task exit code via the Batch API to determine the task status. But I was not able to make it work
Thanks so much @pditommaso 😄 .
We are actually able to get it to work, but required some changes to the nf-azure
plugin to allow us to increase the maximum number of retries in the BlobServiceClient.
Trying to increase the maximum number of tries via the AZURE_REQUEST_RETRY_COUNT
environment variable didn't work. It remained at 4 (I believe this comes from the default value set here: https://github.com/Azure/azure-sdk-for-java/blob/a17d6744deb0dd6501e62ca8356c839799423d79/sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/policy/RequestRetryOptions.java#L101)
However, by modifying the BlobServiceClientBuilder
code that is part of nf-azure
, we were able to make adjustments to the retry options for blob connections. This is specifically configured in this bit of code in nf-azure
:
If this is changed to below to add a RequestRetryOptions as part of the configuration, then the retry options for connections can be adjusted:
return new BlobServiceClientBuilder()
.credential(credential)
.endpoint(endpoint)
.retryOptions(new RequestRetryOptions(RetryPolicyType.EXPONENTIAL, maxTries, tryTimeoutInSeconds, null, null, null))
.buildClient()
You can see all the changes I made here https://github.com/apetkau/nextflow/compare/v24.04.2...nf-azure-1.6.0-nmlpatch0 (everything else is additional logging statements to help us debug).
I am wondering if nf-azure
could be updated to include the ability to configure these retry options for BlobServiceClient? Would it make the most sense to pass in the existing values from the Nextflow config for Azure here (the azure.retryPolicy.maxAttempts
and so on)?
Please let me know if you'd prefer I made a new feature request issue for this.
That's a great feedback. Thank you so much. Tagging @bentsherman and @adamrtalbot for visibility
@apetkau if you put in a feature request we can see if we can help implement.
Thanks @vsmalladi, I've made a feature request here: https://github.com/nextflow-io/nextflow/issues/5097
Bug report
We are using Nextflow with Azure batch to process collections of microbial genomes (whole-genome sequence data). We have began testing out processing larger collections of genomes and have been encountering issues with some of the tasks run by Nextflow that cause Nextflow to fail the task with a
java.util.concurrent.TimeoutException
. This primarily occurs when attempting to read the.exitcode
from a task from blob storage, which causes Nextflow to fail the task/returnInteger.MAX_INTEGER
as the exit code. For example (see below for more context in error message):This behavior occurs only when scaling up to many genomes, and only impacts some of our runs. It also seems to impact random tasks/processes (in the above case, it is when running
QUAST
, but it occurs for random processes in the full pipeline). The pipeline we are using is https://github.com/phac-nml/mikrokondo/. I have observed it in other pipelines, but much less frequently. I believe it occurs in this pipeline since it does a lot of processing/may take up to 2 hours to process a genome.Expected behavior and actual behavior
I would expect all of our pipeline executions to complete successfully and for there to be no
TimeoutExceptions
when reading outputs of a task (e.g., the.exitcode
file in Azure blob storage).Steps to reproduce the problem
As this is an issue that occurs mainly with large-scale analysis of genomes within Azure, and does not happen every time, it is a bit more difficult to provide a specific set of steps to reproduce the issue. However, here is a rough sketch:
Program output
I am unfortunately unable to share the full nextflow.log, but here is the relevant section.
What I have observed is that the below exception prints out 3 times, and then on the fourth time the task fails:
Note that the default timeout is
60,000 milliseconds
(1 minute), but we modified it be600,000 milliseconds
(10 minutes) by adjusting theAZURE_REQUEST_RESPONSE_TIMEOUT
environment variable as described here: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/core/azure-core/README.md#http-timeouts. This didn't fix the problem though, just increased the time before failure.Environment
Additional context
We have spent a bit of time trying to identify the cause of this issue, but haven't been able to isolate and address it. However, we have some additional information on the locations in code where this is happening. I'm providing this information in case it is useful, but please feel free to skip the below (I might be wrong in some of this too).
1. TimeoutException failing a task
The
TimeoutException
that finally triggers failing a task is logged in the following line:This occurs in this section of Nextflow code (the
AzBatchTaskHandler
):https://github.com/nextflow-io/nextflow/blob/e6a5e17db91ca6fae875df42053bd8c1b2a23bd4/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchTaskHandler.groovy#L170-L178
That is, it's attempting to read the exit status via the
exitFile.text
property. TheexitFile
is of typeAzPath
, which is of typePath
.2. Reading from Path.text in Azure
I am not as familiar with how Groovy ultimately handles
Path.text
for reading and returning the contents of a file as text, but in Nextflow with Azure I think this ultimately runs this bit of code inAzFileSystem
to open up a stream to the file in blob storage:https://github.com/nextflow-io/nextflow/blob/e6a5e17db91ca6fae875df42053bd8c1b2a23bd4/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystem.groovy#L201-L217
3. BlobClient.openInputStream()
Line 208 above runs
client.openInputStream()
. Theclient
object is of typeBlobClient
, which extends fromBlobClientBase
and theopenInputStream()
is defined here in the Azure Java SDK:https://github.com/Azure/azure-sdk-for-java/blob/421555531b3e83a5df3ca605653c46f8c9c7d6de/sdk/storage/azure-storage-blob/src/main/java/com/azure/storage/blob/specialized/BlobClientBase.java#L292-L300
4. Azure Java SDK HTTP Pipeline
The Azure Java SDK seems to use an HTTP Pipeline for defining different steps to take when handling an API request to Azure: https://learn.microsoft.com/en-us/azure/developer/java/sdk/http-client-pipeline#http-pipeline
In particular, there exists configurable RetryPolicies that are part of the Azure SDK: https://learn.microsoft.com/en-us/azure/developer/java/sdk/http-client-pipeline#common-http-pipeline-policies
5. Azure/HTTP Pipeline retry policies
This brings us to the other part of the exception in the Nextflow log files:
This appears to be triggered by this code:
https://github.com/Azure/azure-sdk-for-java/blob/421555531b3e83a5df3ca605653c46f8c9c7d6de/sdk/core/azure-core-http-netty/src/main/java/com/azure/core/http/netty/implementation/AzureSdkHandler.java#L199-L207
6. Nextflow retry policies
I do know there are
azure.retryPolicy.*
parameters that can be adjusted in the Nextflow config: https://www.nextflow.io/docs/latest/config.html#config-azure. We have tried adjusting them, but I'm guessing in this case when reading the.exitcode
file from Azure blob storage, these policies aren't being applied, and so it's defaulting to the retry policies that are configured as default within the Azure SDK. I'm not sure if this is expected to cause any issues?Also, some of the above description of the code might be incorrect. I have been running up against my own lack of knowledge of how everything works, which makes it hard to debug.