nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.69k stars 620 forks source link

Defer error exit status to end of pipeline run #4446

Closed adamrtalbot closed 3 months ago

adamrtalbot commented 10 months ago

really go as far as possible in running the pipeline, but then exit 1 at the very end if any errors were ignored.

This is a pretty useful feature. Imagine the scenario where you have 96 independent samples, if 1 fails and it brings the whole pipeline crashing down that's annoying, but if you use errorStrategy = 'ignore' then you have to write your own solution. We could have process.errorStrategy = 'catch'. This might be a separate ticket though.

Having some form of introspection on tasks in the workflow.onComplete block would also help for similar reasons and be more generally useful.

Originally posted by @adamrtalbot in https://github.com/nextflow-io/nextflow/issues/4365#issuecomment-1759191011

pditommaso commented 10 months ago

Is it not finish doing this?

ewels commented 10 months ago

No, finish allows the currently running jobs to complete. This would tell Nextflow to continue submitting new jobs for as long as it can before dying.

The use case is for automated runs. If there is one bad sample out of hundreds, the run will fail early and need to wait until it can be manually restarted. Then the whole pipeline needs to run through, which is slow. With this option, the other samples would all process so when the pipeline is resumed without the bad sample it would complete very quickly.

pditommaso commented 10 months ago

yeah, you are right. I was trying to be lazy 😆

bentsherman commented 10 months ago

We could either change ignore to return a non-zero exit code or add a new strategy like ignoreThenFail, which one would you guys prefer?

pinin4fjords commented 10 months ago

I think ignnoreThenFail is the more generally understandable solution.

adamrtalbot commented 10 months ago

Agreed.

adamrtalbot commented 10 months ago

If you want to be fancy, we could call it defer but I'd like a non-native speaker to check that makes sense.

ewels commented 10 months ago

I generally prefer verbose and clear, for me ignoreThenFail is the winner in this list 👆🏻 defer is more likely to send me scurrying to the docs.

boothms commented 7 months ago

This is the solution that I was looking for. Is it still under consideration ?

ghost commented 7 months ago

I would like to second @boothms message. This would greatly improve the runs with multiple samples where at least one process for one sample might fail. It would be great if the progress of the execution could continue for the rest of the healthy samples.

ghost commented 7 months ago

@bentsherman , thank you so much for jumping on this. I have a quick question about the implementation.

If one sets the errorStrategy 'ignoreThenFail', that implies that the task will never be retried, correct?

Would there be a place where these 2 are not mutually exclusive? Try the maxRetries, and then if it fails, still have the possibility to defer the exit status to the end of the pipeline?

bentsherman commented 7 months ago

You could do that with a closure:

process.errorStrategy = { task.attempt < 3 ? 'retry' : 'ignoreThenFail' }
ghost commented 7 months ago

This is great. Thank you once again.

I'm really looking forward to testing this.

ghost commented 6 months ago

Hi folks, I saw that the PR is almost done. Do you have an estimate on when this might be merged?

adamrtalbot commented 5 months ago

Relevant community discussion: https://community.seqera.io/t/handling-single-sample-failures/618/3

pditommaso commented 3 months ago

So ignoreThenFail cannot be used because errorStrategy cannot be longer than 10 chars.

Two choices:

  1. find a shorted name
  2. turn this setting into a nextflow config option (not an errorStrategy) e.g. nextflow.failOnIgnoredErrors = true
pinin4fjords commented 3 months ago

Where does the 10 character limit derive from?

But from conversations with users, I think the errorStrategy is the most intuitive thing, the thing people expect to find.

ignoreFail? Though that sounds a bit like we're ignoring a failure.

adamrtalbot commented 3 months ago

My other suggestion was defer, which isn't as clear but is shorter than 10 characters.

Like Jon, I have to ask where does the 10 character limit come from?

pditommaso commented 3 months ago

where does the 10 character limit come from

The platform db schema

pinin4fjords commented 3 months ago

deferFail?

FriederikeHanssen commented 3 months ago

delayFail ?

pinin4fjords commented 3 months ago

I asked ChatGPT:

pditommaso commented 3 months ago

and the winner is failAtEnd 😄

bentsherman commented 3 months ago

I think I'm coming around to ignoreFail and just rely the language server to provide an inline explanation (i.e. hover hint, code completion) if the user isn't clear on the meaning

pditommaso commented 3 months ago

I'm still not convinced about the best to proceed with this. Apart from the strategy name length, I'd like to avoid introducing a new error strategy for doing a similar thing to ignore, above all because this is going to impact some logic in Platform about how error are reported, etc.

I've made a variation in which the user is expected to use the string ignoreThenFail as error strategy, however behind the scenes is mapped to ignore, but still causing the failure on completion.

https://github.com/nextflow-io/nextflow/pull/5066

In this way, the changes is transparent from the point of view of Platform. However I fear that it can be difficult to determine why it failed because tasks will report ignore as retry action

pditommaso commented 3 months ago

I've made two other implementation for this feature that prevent the propagation of a new error strategy literal.

PR #5066 implements a pseudo-error strategy named ignoreThenFail, which is replaced behind the scenes with ignore and triggers the expected behaviour. However it sounds a bit hacky and not transparent the behavioural.

PR #5071 implements the same logic as a nextflow config setting.

bentsherman commented 3 months ago

I think I prefer #5071. When the workflow fails due to failOnIgnore, we can say as much in the error message so that it's clear why the workflow failed