nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.75k stars 628 forks source link

Workflow option `failOnIgnore` causes workflow to hang #5291

Closed kgalens closed 1 month ago

kgalens commented 1 month ago

Bug report

Expected behavior and actual behavior

When using the ignore errorStrategy with the workflow option failOnIgnore, the pipeline hangs when there's a task failure.

Steps to reproduce the problem

workflow.nf

process process1 {
    input:
    val sample_id

    output:
    val sample_id, emit: sample_ids

    script:
    """
    if [[ $sample_id == "SAMP1" ]]; then
        exit 2
    fi
    ls -lah .*
    """
}

process process2 {
    input:
    val ready

    output:
    stdout

    script:
    """
    ls -lah .*
    """

}

workflow {
    input_channel = channel.of("SAMP1", "SAMP2", "SAMP3")
    process1(input_channel)
    process2(process1.out.sample_ids.collect())
}

Nextflow Config

workflow {
    failOnIgnore = true
}
process {
  errorStrategy = 'ignore'
}

I would expect that the workflow would complete with a non-zero exit status.

Program output


 N E X T F L O W   ~  version 24.05.0-edge

Launching `/path/to/workflows/nextflow/hello_world/main.nf` [infallible_cajal] DSL2 - revision: fe2c285334

executor >  local (3)
executor >  local (3)
[09/916c3a] process1 (2) [100%] 3 of 3, failed: 1 ✔
[-        ] process2     [  0%] 0 of 1
[f2/61fcf1] NOTE: Process `process1 (1)` terminated with an error exit status (2) -- Error is ignored

And it hangs and doesn't finish.

Environment

Additional context

adamrtalbot commented 1 month ago

I can reproduce the issue on 24.08.0-edge:

> /usr/local/bin/nextflow-24.08.0-edge run .
N E X T F L O W  ~  version 24.08.0-edge
Launching `./main.nf` [focused_blackwell] DSL2 - revision: bc82ab126c
[af/ce2eeb] Submitted process > process1 (3)
[f7/ada6f2] Submitted process > process1 (1)
[12/e82491] Submitted process > process1 (2)
[f7/ada6f2] NOTE: Process `process1 (1)` terminated with an error exit status (2) -- Error is ignored

(hangs forever)

adamrtalbot commented 1 month ago

failOnError.nextflow.log

bentsherman commented 1 month ago

It looks like process1 completes, then the process2 task is scheduled, but never run:

Sep-09 20:52:35.981 [Actor Thread 2] TRACE nextflow.processor.TaskProcessor - Invoking task > process2 with params=id=4; index=1; values=[[SAMP2, SAMP3], true]
Sep-09 20:52:35.981 [Actor Thread 12] TRACE nextflow.processor.TaskProcessor - <process2> Process state changed to: StateObj[submitted: 1; completed: 0; poisoned: false ] -- finished: false
Sep-09 20:52:35.981 [Actor Thread 11] TRACE nextflow.processor.TaskProcessor - <process2> Control message arrived $ => groovyx.gpars.dataflow.operator.PoisonPill@e3b762d
Sep-09 20:52:35.982 [Actor Thread 11] TRACE nextflow.processor.TaskProcessor - <process2> Poison pill arrived; port: 1
Sep-09 20:52:35.982 [Actor Thread 2] TRACE nextflow.processor.TaskContext - Binding names for 'process2' > []
Sep-09 20:52:35.983 [Actor Thread 12] TRACE nextflow.processor.StateObj - <process2> State before poison: StateObj[submitted: 1; completed: 0; poisoned: false ]
Sep-09 20:52:35.983 [Actor Thread 12] TRACE nextflow.processor.TaskProcessor - <process2> Process state changed to: StateObj[submitted: 1; completed: 0; poisoned: true ] -- finished: false
Sep-09 20:52:35.986 [Actor Thread 2] TRACE nextflow.processor.TaskProcessor - [process2] Store dir not set -- return false
Sep-09 20:52:35.989 [Actor Thread 2] TRACE nextflow.processor.TaskProcessor - [process2] Cacheable folder=null -- exists=false; try=1; shouldTryCache=false; entry=null
Sep-09 20:52:35.991 [Actor Thread 2] TRACE nextflow.processor.TaskProcessor - [process2] actual run folder: /home/bent/projects/sketches/work/d3/ba17a52e118f36fd05c1434927dd8a
Sep-09 20:52:35.995 [Actor Thread 2] TRACE n.processor.TaskPollingMonitor - Scheduled task > TaskHandler[id: 4; name: process2; status: NEW; exit: -; error: -; workDir: /home/bent/projects/sketches/work/d3/ba17a52e118f36fd05c1434927dd8a]
Sep-09 20:52:35.996 [Actor Thread 2] TRACE nextflow.processor.TaskProcessor - <process2> After run
Sep-09 20:52:35.996 [Actor Thread 11] TRACE nextflow.processor.TaskProcessor - <process2> After stop
Sep-09 20:52:36.036 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 0 (iteration: 9)

In fact, if I comment out process2 then the run finishes. Strange that it only happens with failOnIgnore.

Right now I suspect there is some race condition in the task polling monitor that is causing it to not submit the task when it should be able to.

bentsherman commented 1 month ago

Bingo:

https://github.com/nextflow-io/nextflow/blob/6e866ae81ff3bf8a9729e9dbaa9dd89afcb81a4b/modules/nextflow/src/main/groovy/nextflow/processor/TaskPollingMonitor.groovy#L586-L590

https://github.com/nextflow-io/nextflow/blob/6e866ae81ff3bf8a9729e9dbaa9dd89afcb81a4b/modules/nextflow/src/main/groovy/nextflow/Session.groovy#L829