nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.77k stars 630 forks source link

Exception in TaskProcessor may be handled incorrectly #3277

Open bentsherman opened 2 years ago

bentsherman commented 2 years ago

Bug report

I received a log with the following errors:

Oct-02 16:01:22.712 [FileTransfer-5] DEBUG nextflow.file.FilePorter - Copying foreign file [TOWER_API_PATH]/samplesheet.csv to work dir: [AZURE_BLOB_PATH]/samplesheet.csv
Oct-02 16:01:23.179 [FileTransfer-5] WARN  nextflow.file.FilePorter - Unable to stage foreign file: [TOWER_API_PATH]/samplesheet.csv (try 3) -- Cause: Unable to access path: [TOWER_API_PATH]/samplesheet.csv
java.io.IOException: Unable to access path: [TOWER_API_PATH]/samplesheet.csv
    at nextflow.file.http.XFileSystemProvider.readAttributes(XFileSystemProvider.groovy:457)
    at java.base/java.nio.file.Files.readAttributes(Files.java:1851)
    at nextflow.file.CopyMoveHelper.copyToForeignTarget(CopyMoveHelper.java:156)
    at nextflow.file.FileHelper.copyPath(FileHelper.groovy:946)
    at nextflow.file.FilePorter$FileTransfer.stageForeignFile0(FilePorter.groovy:299)
    at nextflow.file.FilePorter$FileTransfer.stageForeignFile(FilePorter.groovy:266)
    at nextflow.file.FilePorter$FileTransfer.run(FilePorter.groovy:252)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Oct-02 16:01:23.509 [Actor Thread 8] INFO  nextflow.processor.TaskProcessor - [null] NOTE: Can't stage file [TOWER_API_PATH]/samplesheet.csv -- reason: Unable to access path: [TOWER_API_PATH]/samplesheet.csv -- Execution is retried (1)
Oct-02 16:01:23.511 [pool-5-thread-2] ERROR nextflow.processor.TaskProcessor - Unable to re-submit task `NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:CONCAT_H5AD`
java.lang.NullPointerException: Cannot invoke "com.google.common.hash.HashCode.asBytes()" because "hash" is null
    at nextflow.processor.TaskProcessor.checkCachedOrLaunchTask(TaskProcessor.groovy:747)
        ...
    at nextflow.processor.TaskProcessor$_checkErrorStrategy_closure19.doCall(TaskProcessor.groovy:1086)
    at nextflow.processor.TaskProcessor$_checkErrorStrategy_closure19.doCall(TaskProcessor.groovy)
        ...
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Oct-02 16:01:23.518 [pool-5-thread-2] DEBUG nextflow.Session - Session aborted -- Cause: Cannot invoke "com.google.common.hash.HashCode.asBytes()" because "hash" is null

Basically, the task processor attempts to stage a foreign file, which fails, and then it attempts to launch a task, and then fails because the task hash is null.

I think the problem is that the task processor's catch-all error handler does not correctly handle this type of error.

Here is where the task is created, prepared, and launched:

    final protected void invokeTask( Object[] args ) {
        // ...

        // create the task run instance
        final task = createTaskRun(params)

        // ...

        // stage foreign input files (fails here)
        makeTaskContextStage2(task, secondPass, count)

        // ...

        // launch the task (should not happen if file staging failed)
        def hash = createTaskHashKey(task)
        checkCachedOrLaunchTask(task, hash, resumable)
    }

This invokeTask is invoked with a TaskProcessorInterceptor:

        def interceptor = new TaskProcessorInterceptor(opInputs, singleton)
        def params = [inputs: opInputs, maxForks: session.poolSize, listeners: [interceptor] ]
        def invoke = new InvokeTaskAdapter(this, opInputs.size())
        session.allOperators << (operator = new DataflowOperator(group, params, invoke))

Ths TaskProcessorInterceptor contains a method onException which triggers the error strategy via resumeOrDie:

        boolean onException(final DataflowProcessor processor, final Throwable error) {
            // return `true` to terminate the dataflow processor
            handleException( error, currentTask.get() )
        }

It seems this handler is design to catch errors from tasks that were launched but failed. However, I think it is also catching this error with the file staging. In this case it should always terminate instead of potentially retrying, otherwise it will attempt to retry an task that isn't fully constructed.

cc @pditommaso

Environment

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

drernie commented 1 year ago

I seem to be having a similar problem:

Sep-04 17:59:43.831 [FileTransfer-1] DEBUG nextflow.file.FilePorter - Copying foreign file quilt+s3://quilt-example#package=examples%2fhurdat to work dir: /Users/quilt/Documents/GitHub/nf-quilt/work/stage-4b764ded-2b62-43a3-9453-2e24b14dfdc5/d5/bf26e57d4415db26b48dbcb4417bbc/quilt-example#package=examples%2fhurdat
Sep-04 17:59:44.078 [FileTransfer-1] DEBUG nextflow.quilt.nio.QuiltPath - QuiltPath.localPath.shouldInstall[true]: quilt-example#package=examples%2fhurdat&path=data%2fatlantic-storms.csv
Sep-04 17:59:44.078 [FileTransfer-1] DEBUG nextflow.quilt.jep.QuiltPackage - QuiltPackage.install: already installed: /var/folders/rr/hp1w0hxd07lgq1y8k9dmnrwr0000gq/T/QuiltPackage5183074415959940109/QuiltPackage.quilt_example_examples_hurdat
Sep-04 17:59:44.078 [FileTransfer-1] DEBUG nextflow.quilt.nio.QuiltPath - QuiltPath.localPath.shouldInstall[false]: quilt-example#package=examples%2fhurdat&path=data%2fatlantic-storms.csv
Sep-04 17:59:44.078 [FileTransfer-1] DEBUG n.quilt.nio.QuiltFileSystemProvider - Creating `newByteChannel`: quilt-example#package=examples%2fhurdat&path=data%2fatlantic-storms.csv <- []
Sep-04 17:59:44.078 [FileTransfer-1] DEBUG nextflow.quilt.nio.QuiltPath - QuiltPath.localPath.shouldInstall[false]: quilt-example#package=examples%2fhurdat&path=data%2fatlantic-storms.csv
Sep-04 17:59:44.079 [FileTransfer-1] DEBUG n.quilt.nio.QuiltFileSystemProvider -    Ensure installed: /var/folders/rr/hp1w0hxd07lgq1y8k9dmnrwr0000gq/T/QuiltPackage5183074415959940109/QuiltPackage.quilt_example_examples_hurdat/data/atlantic-storms.csv
Sep-04 17:59:44.079 [FileTransfer-1] DEBUG nextflow.quilt.jep.QuiltPackage - QuiltPackage.install: already installed: /var/folders/rr/hp1w0hxd07lgq1y8k9dmnrwr0000gq/T/QuiltPackage5183074415959940109/QuiltPackage.quilt_example_examples_hurdat
Sep-04 17:59:44.079 [FileTransfer-1] DEBUG n.quilt.nio.QuiltFileSystemProvider -    Reading from: /var/folders/rr/hp1w0hxd07lgq1y8k9dmnrwr0000gq/T/QuiltPackage5183074415959940109/QuiltPackage.quilt_example_examples_hurdat/data/atlantic-storms.csv
Sep-04 17:59:44.089 [Actor Thread 5] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=transfer (1); work-dir=null
  error [nextflow.exception.ProcessStageException]: Can't stage file quilt+s3://quilt-example#package=examples%2fhurdat -- file does not exist
Sep-04 17:59:44.104 [Actor Thread 5] ERROR nextflow.processor.TaskProcessor - Error executing process > 'transfer (1)'

Caused by:
  Can't stage file quilt+s3://quilt-example#package=examples%2fhurdat -- file does not exist

I've confirmed that newByteChannel does return properly (and the path does exist), so I'm not clear on what's going on.

pditommaso commented 1 year ago

Can you please share the complete error stack trace?

drernie commented 1 year ago

nextflow.log There's no stack track per se. Just the debug logs.

Sep-05 13:43:34.469 [main] DEBUG nextflow.Session - Igniting dataflow network (2)
Sep-05 13:43:34.479 [main] DEBUG nextflow.processor.TaskProcessor - Starting process > transfer
Sep-05 13:43:34.479 [PathVisitor-1] DEBUG nextflow.file.FileHelper - Creating a file system instance for provider: QuiltFileSystemProvider
Sep-05 13:43:34.480 [PathVisitor-1] DEBUG n.quilt.nio.QuiltFileSystemProvider - QuiltFileSystemProvider.getPath`[quilt+s3://quilt-example#package=examples/hurdat]
Sep-05 13:43:34.480 [main] DEBUG nextflow.script.ScriptRunner - > Awaiting termination 
Sep-05 13:43:34.480 [main] DEBUG nextflow.Session - Session await
Sep-05 13:43:34.536 [Actor Thread 5] DEBUG nextflow.quilt.jep.QuiltPackage - QuiltPackage.quilt_example_examples_hurdat: Do NOT pre-install every package (use list API instead)
Sep-05 13:43:34.536 [Actor Thread 5] INFO  nextflow.quilt.jep.QuiltPackage - installing examples/hurdat from quilt-example...
Sep-05 13:43:37.607 [Actor Thread 5] INFO  s.a.a.t.s.p.LoggingTransferListener - Transfer initiated...
Sep-05 13:43:37.644 [Actor Thread 5] INFO  s.a.a.t.s.p.LoggingTransferListener - Transfer initiated...
Sep-05 13:43:37.646 [Actor Thread 5] INFO  s.a.a.t.s.p.LoggingTransferListener - Transfer initiated...
Sep-05 13:43:37.648 [Actor Thread 5] INFO  s.a.a.t.s.p.LoggingTransferListener - Transfer initiated...
Sep-05 13:43:38.201 [AwsEventLoop 6] INFO  s.a.a.t.s.p.LoggingTransferListener - |====================| 100.0%
Sep-05 13:43:38.207 [sdk-async-response-2-0] INFO  s.a.a.t.s.p.LoggingTransferListener - Transfer complete!
Sep-05 13:43:38.240 [AwsEventLoop 5] INFO  s.a.a.t.s.p.LoggingTransferListener - |====================| 100.0%
Sep-05 13:43:38.242 [sdk-async-response-2-1] INFO  s.a.a.t.s.p.LoggingTransferListener - Transfer complete!
Sep-05 13:43:38.314 [AwsEventLoop 3] INFO  s.a.a.t.s.p.LoggingTransferListener - |====================| 100.0%
Sep-05 13:43:38.314 [sdk-async-response-2-2] INFO  s.a.a.t.s.p.LoggingTransferListener - Transfer complete!
Sep-05 13:43:39.830 [AwsEventLoop 4] INFO  s.a.a.t.s.p.LoggingTransferListener - |====================| 100.0%
Sep-05 13:43:39.843 [sdk-async-response-2-3] INFO  s.a.a.t.s.p.LoggingTransferListener - Transfer complete!
Sep-05 13:43:39.844 [Actor Thread 5] INFO  nextflow.quilt.jep.QuiltPackage - done
Sep-05 13:43:39.872 [FileTransfer-1] DEBUG nextflow.file.FilePorter - Copying foreign file quilt+s3://quilt-example#package=examples%2fhurdat to work dir: /Users/quilt/Documents/GitHub/nf-quilt/work/stage-a2996f23-58a3-45ae-a7fa-be35c661b9a4/0b/36d858ee5bd1f81c562742c10902fd/quilt-example#package=examples%2fhurdat
Sep-05 13:43:40.079 [FileTransfer-1] DEBUG nextflow.quilt.jep.QuiltPackage - QuiltPackage.install: already installed: /var/folders/rr/hp1w0hxd07lgq1y8k9dmnrwr0000gq/T/QuiltPackage7609402337477599622/QuiltPackage.quilt_example_examples_hurdat
Sep-05 13:43:40.080 [FileTransfer-1] DEBUG nextflow.quilt.jep.QuiltPackage - QuiltPackage.install: already installed: /var/folders/rr/hp1w0hxd07lgq1y8k9dmnrwr0000gq/T/QuiltPackage7609402337477599622/QuiltPackage.quilt_example_examples_hurdat
Sep-05 13:43:40.080 [FileTransfer-1] DEBUG n.quilt.nio.QuiltFileSystemProvider -    Reading from: /var/folders/rr/hp1w0hxd07lgq1y8k9dmnrwr0000gq/T/QuiltPackage7609402337477599622/QuiltPackage.quilt_example_examples_hurdat/data/atlantic-storms.csv
Sep-05 13:43:40.080 [FileTransfer-1] DEBUG n.quilt.nio.QuiltFileSystemProvider - FileChannel.open: sun.nio.ch.FileChannelImpl@55543023
Sep-05 13:43:40.084 [Actor Thread 5] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=transfer (1); work-dir=null
  error [nextflow.exception.ProcessStageException]: Can't stage file quilt+s3://quilt-example#package=examples%2fhurdat -- file does not exist
Sep-05 13:43:40.103 [Actor Thread 5] ERROR nextflow.processor.TaskProcessor - Error executing process > 'transfer (1)'

Caused by:
  Can't stage file quilt+s3://quilt-example#package=examples%2fhurdat -- file does not exist

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
Sep-05 13:43:40.106 [Actor Thread 5] DEBUG nextflow.Session - Session aborted -- Cause: Can't stage file quilt+s3://quilt-example#package=examples%2fhurdat -- file does not exist
Sep-05 13:43:40.119 [Actor Thread 5] DEBUG nextflow.Session - The following nodes are still active:
[process] transfer
  status=ACTIVE
  port 0: (queue) closed; channel: x
  port 1: (cntrl) -     ; channel: $

Sep-05 13:43:40.121 [main] DEBUG nextflow.Session - Session await > all processes finished
Sep-05 13:43:40.121 [main] DEBUG nextflow.Session - Session await > all barriers passed
Sep-05 13:43:40.122 [Actor Thread 3] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=transfer; work-dir=null
  error [java.lang.InterruptedException]: java.lang.InterruptedException
Sep-05 13:43:40.129 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=0; failedCount=0; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=0ms; failedDuration=0ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=0; peakCpus=0; peakMemory=0; ]
drernie commented 1 year ago

FYI, reproduced the issue as far back as version 22.12.0-edge. The only thing we changed was NOT pre-installing all the files, so maybe there's a code path I'm missing where we need to do just-in-time installation. Will try looking into the source code...

drernie commented 1 year ago

So, the error seems to happen (and get caught without a stack trace) from CopyMoveHelper.copyToForeignTarget

https://github.com/ojdkbuild/lookaside_java-1.8.0-openjdk/blob/master/jdk/src/share/classes/java/nio/file/CopyMoveHelper.java#L100

Which calls:

Maybe I can unit test those to reproduce the error...

bentsherman commented 1 year ago

You can also add a e.printStackTrace() here before the exception is thrown:

https://github.com/nextflow-io/nextflow/blob/a88a4245dc19f36957d7b38e6071639cc5803db2/modules/nextflow/src/main/groovy/nextflow/file/FilePorter.groovy#L330

drernie commented 1 year ago

Ah, thanks!

nextflow.exception.ProcessStageException: Can't stage file quilt+s3://quilt-example#package=examples%2fhurdat -- file does not exist
    at nextflow.file.FilePorter$FileTransfer.stageForeignFile(FilePorter.groovy:332)
    at nextflow.file.FilePorter$FileTransfer.run(FilePorter.groovy:301)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.nio.file.NoSuchFileException: /Users/quilt/Documents/GitHub/nf-quilt/work/stage-845ccbaa-3d21-43db-a1ec-c2dba631302d/14/617f6197b1ce64b02d272966333aaa/quilt-example#package=examples%2fhurdat/data/atlantic-storms.csv
    at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
    at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218)
    at java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:484)
    at java.base/java.nio.file.Files.newOutputStream(Files.java:228)
    at java.base/java.nio.file.Files.copy(Files.java:3160)
    at nextflow.file.CopyMoveHelper.copyFile(CopyMoveHelper.java:90)
    at nextflow.file.CopyMoveHelper$1.visitFile(CopyMoveHelper.java:134)
    at nextflow.file.CopyMoveHelper$1.visitFile(CopyMoveHelper.java:108)
    at java.base/java.nio.file.Files.walkFileTree(Files.java:2811)
    at nextflow.file.CopyMoveHelper.copyDirectory(CopyMoveHelper.java:140)
    at nextflow.file.CopyMoveHelper.copyToForeignTarget(CopyMoveHelper.java:169)
    at nextflow.file.FileHelper.copyPath(FileHelper.groovy:961)
    at nextflow.file.FilePorter$FileTransfer.stageForeignFile0(FilePorter.groovy:354)
drernie commented 1 year ago

Amusingly, it looks like we broke "CopyMoveHelper#copyToForeignTarget" -- which this comment promised to prevent!

class QuiltFileAttributesView implements BasicFileAttributeView {
    //....
    /**
     * This API is implemented is not supported but instead of throwing an exception just do nothing
     * to not break the method {@link java.nio.file.CopyMoveHelper#copyToForeignTarget(...)}
     *
     * @param lastModifiedTime
     * @param lastAccessTime
     * @param createTime
     * @throws IOException
     */
    @Override
    void setTimes(FileTime lastModifiedTime, FileTime lastAccessTime, FileTime createTime) throws IOException {
    // TBD
    }

}
drernie commented 1 year ago

Anyway, I still think the Exception is being handled sub-optimally, but I have opened an issue on nf-quilt to work on the underlying bug: https://github.com/quiltdata/nf-quilt/issues/124