nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.78k stars 634 forks source link

Validation of task outputs #3711

Open robsyme opened 1 year ago

robsyme commented 1 year ago

Fail tasks where stageOut fails

This is similar to existing issue https://github.com/nextflow-io/nextflow/issues/3372, but instead of being concerned with the publishDir step (run work directory to publish output location), this issue is concerned with the output step (task work directory to run work directory)

Usage scenario

When task outputs are copied from the task working directory to the run work directory using the aws cli, the aws s3 cp process can fail. Nextflow ignores this failure with a || true, which means that in the event of failure, the downstream tasks expecting this input will fail. This makes debugging harder than it should be, as users will be looking for the error in the downstream task rather than in the task that actually failed.

Suggest implementation

I would recommend that we remove the || true on L187 here:

https://github.com/nextflow-io/nextflow/blob/2a224258cdd75321f8b3a239c93fdd688c7dc370/modules/nextflow/src/main/groovy/nextflow/executor/SimpleFileCopyStrategy.groovy#L184-L190

In addition, it may be helpful to provide a more descriptive warning in the event of an upload failure, but I'm not 100% sure how best to communicate that to the Nextflow process.

pditommaso commented 1 year ago

This is somehow related to this comment. @mribeirodantas tried removing || true but reported no difference.

Some Bash hacker may want to give it another try

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jchorl commented 3 months ago

I've hit this as well, when a cp fails for any reason and it silently fails to stage out. Even worse if the output is optional, because nextflow can't differentiate between not-produced and failed-to-stage-out. Seeing the linked issues here would indicate it's somewhat important.

This is somehow related to this comment. @mribeirodantas tried removing || true but reported no difference.

Some Bash hacker may want to give it another try

Do you recall why? in theory if set -e is set and a cp operation fails, the exit code should be non-zero and it should fail.

$ cp doesntexist foo
cp: cannot stat 'doesntexist': No such file or directory
$ echo $?
1
pditommaso commented 2 months ago

Here there's a tentative solution but it was aborted

jorgee commented 2 months ago

I think the @pditommaso solution https://github.com/nextflow-io/nextflow/pull/3858 (add set-e before nxf_unstage()) and the change to wait all nxf_parallel processes https://github.com/nextflow-io/nextflow/pull/4050 should detect the exit codes. The statement wait $p returns the exit code and the set -e is failing the process when the wait returns something different to 0.

I have tested with Azure forcing an error in the stage out. It is detecting the failure and setting the exit code of the azcopy process. I need to check with the other clouds.

Another issue that I see is the print of the errors. Stage-out errors are in the .command.log but not in the .command.err, and it is not printed when there is an error in a process execution.