nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.68k stars 621 forks source link

Fair directive - downstream processes stop unexpectedly #3862

Open jbanusco opened 1 year ago

jbanusco commented 1 year ago

Bug report

I have a workflow that runs a data processing pipeline that I would like to apply to several subjects. I have problems with the cache since some subjects that are already processed are being re-processed every time I run the workflow. Therefore, I tried to use the fair directive. The fair derivative seems to solve this issue, but I noticed a possible bug.

Expected behavior and actual behavior

In an HPC cluster [with slurm and local executor]: If I have 10 subjects and subjects 4-6 fail, the pipeline keeps going downstream for subjects 1-3, but not for subjects 7-10. It seems like all the processes that take place after an error are stopped. Given that I use the 'ignore' as error strategy I expected that all successful process will keep going downstream.

Locally in my workstation: I noticed that the behavior is different than in the HPC cluster. Here all the downstream processes are stopped.

Steps to reproduce the problem

You can find the code to reproduce the error in this repository: https://github.com/jbanusco/Fair_DummyTest_Bug Run the ./create_files.sh to generate some dummy files and then run the workflow.nf

Program output

HPC:

N E X T F L O W  ~  version 23.03.0-edge
Launching `workflow.nf` [stupefied_goldwasser] DSL2 - revision: a4288bc1db
[sub-001.txt, /home/ja1659/Cardiac_Pipeline/old/fair_minimum_error/data/sub-001.txt]
[sub-002.txt, /home/ja1659/Cardiac_Pipeline/old/fair_minimum_error/data/sub-002.txt]
[sub-003.txt, /home/ja1659/Cardiac_Pipeline/old/fair_minimum_error/data/sub-003.txt]
[sub-004.txt, /home/ja1659/Cardiac_Pipeline/old/fair_minimum_error/data/sub-004.txt]
[sub-005.txt, /home/ja1659/Cardiac_Pipeline/old/fair_minimum_error/data/sub-005.txt]
[sub-006.txt, /home/ja1659/Cardiac_Pipeline/old/fair_minimum_error/data/sub-006.txt]
[sub-007.txt, /home/ja1659/Cardiac_Pipeline/old/fair_minimum_error/data/sub-007.txt]
[sub-008.txt, /home/ja1659/Cardiac_Pipeline/old/fair_minimum_error/data/sub-008.txt]
[sub-009.txt, /home/ja1659/Cardiac_Pipeline/old/fair_minimum_error/data/sub-009.txt]
[29/dfbb1a] Submitted process > ReadFiles (sub-001.txt_ReadFile)
[71/7a1a3f] Submitted process > ReadFiles (sub-004.txt_ReadFile)
[15/6ef76a] Submitted process > ReadFiles (sub-008.txt_ReadFile)
[ed/a2ce06] Submitted process > ReadFiles (sub-009.txt_ReadFile)
[07/259c01] Submitted process > ReadFiles (sub-007.txt_ReadFile)
[dd/e09a9c] Submitted process > ReadFiles (sub-002.txt_ReadFile)
[b0/6f1529] Submitted process > ReadFiles (sub-005.txt_ReadFile)
[e7/27e4ee] Submitted process > ReadFiles (sub-003.txt_ReadFile)
[cc/795c55] Submitted process > ReadFiles (sub-006.txt_ReadFile)
[71/7a1a3f] NOTE: Process `ReadFiles (sub-004.txt_ReadFile)` terminated with an error exit status (1) -- Error is ignored
[b0/6f1529] NOTE: Process `ReadFiles (sub-005.txt_ReadFile)` terminated with an error exit status (1) -- Error is ignored
[cc/795c55] NOTE: Process `ReadFiles (sub-006.txt_ReadFile)` terminated with an error exit status (1) -- Error is ignored
[10/285748] Submitted process > DummyProcess (sub-003.txt_Dummy)
[ab/130b40] Submitted process > DummyProcess (sub-001.txt_Dummy)
[79/f6d23c] Submitted process > DummyProcess (sub-002.txt_Dummy)
[1f/ba113c] Submitted process > DummyProcess2 (sub-001.txt_Dummy)
[95/5787f1] Submitted process > DummyProcess2 (sub-003.txt_Dummy)
[38/54e5fa] Submitted process > DummyProcess2 (sub-002.txt_Dummy)

Local:

N E X T F L O W  ~  version 23.03.0-edge
Launching `workflow.nf` [intergalactic_wilson] DSL2 - revision: a4288bc1db
executor >  local (9)
[71/0ed1b8] process > ReadFiles (sub-008.txt_ReadFile) [100%] 9 of 9, failed: 3 ✔
[-        ] process > DummyProcess                     -
[-        ] process > DummyProcess2                    -
[sub-005.txt, /home/jaume/Desktop/Code/Cardiac_Pipeline/old/fair_minimum_error/data/sub-005.txt]
[sub-001.txt, /home/jaume/Desktop/Code/Cardiac_Pipeline/old/fair_minimum_error/data/sub-001.txt]
[sub-002.txt, /home/jaume/Desktop/Code/Cardiac_Pipeline/old/fair_minimum_error/data/sub-002.txt]
[sub-007.txt, /home/jaume/Desktop/Code/Cardiac_Pipeline/old/fair_minimum_error/data/sub-007.txt]
[sub-006.txt, /home/jaume/Desktop/Code/Cardiac_Pipeline/old/fair_minimum_error/data/sub-006.txt]
[sub-008.txt, /home/jaume/Desktop/Code/Cardiac_Pipeline/old/fair_minimum_error/data/sub-008.txt]
[sub-009.txt, /home/jaume/Desktop/Code/Cardiac_Pipeline/old/fair_minimum_error/data/sub-009.txt]
[sub-003.txt, /home/jaume/Desktop/Code/Cardiac_Pipeline/old/fair_minimum_error/data/sub-003.txt]
[sub-004.txt, /home/jaume/Desktop/Code/Cardiac_Pipeline/old/fair_minimum_error/data/sub-004.txt]
[00/eca23f] NOTE: Process `ReadFiles (sub-004.txt_ReadFile)` terminated with an error exit status (1) -- Error is ignored
[2d/eaa659] NOTE: Process `ReadFiles (sub-005.txt_ReadFile)` terminated with an error exit status (1) -- Error is ignored
[36/38de2d] NOTE: Process `ReadFiles (sub-006.txt_ReadFile)` terminated with an error exit status (1) -- Error is ignored

Environment

bentsherman commented 1 year ago

This actually makes sense considering how fair and ignore work.

The fair directive ensures that outputs are emitted in the order they were received, therefore it waits for each output to be produced in order. Therefore, since the ignore error strategy simply ignores errors, if you use them together, any error will cause the pipeline to hang forever.

Basically, the fair directive needs to know how to skip outputs that will never be produced because they failed.

By the way, I am curious as to why the fair directive is helping you with cache issues. Why would the order in which outputs are emitted by a process affect the task executions of downstream processes? I suspect you may have written your pipeline in a way that causes non-deterministic behavior. Unless you are explicitly generating random inputs at some point, your pipeline should execute the exact same tasks given the same inputs.

jbanusco commented 1 year ago

Ah, I see! Thank you very much for the clarification, I thought by default the fair directive would skip these outputs.

Regarding the cache issues, you are right. The problem was that the behavior was non-deterministic although I don't fully understand why. I was creating an initial channel with the 'configuration' files for the processing of each subject. Then, each process was generating a new output file indicating if that processing step was successful or not. The input of each step/process in the pipeline was the configuration file + the file of the previous process. So, I started joining the channel with the configuration files with the output of each process based on the subject id, but I was still getting some issues.

Right now I just output the configuration file as an additional output of each process. In this way I don't have to join the output with the 'configuration' channel anymore. This seems to work and prevent the non-deterministic behavior. Hope I was clear!

stale[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.