[feature] allow jobs to fail

yarnabrina commented 2 years ago

Feature Area

/area backend

What feature would you like to see?

Allow failure of jobs - If an operation fails, do not fail the pipeline. Allow the pipeline to continue to the next stage, and there it may fail if that does not have the pre-requisites.

What is the use case or pain point?

In the machine learning pipelines, it is fairly common to run multiple models, or possibly different configurations of same model, and this possibly runs on a subset of training data. After these are trained, usually they are compared using some metric, the best model is chosen, and that is run on the entire training data to have the final trained model.

If someone uses kfp.dsl.ParallelFor to run the different models, failure in one of them causes the entire pipeline to fail and possibly successful training of other models are lost. But if the next stage, the one to compare using metric supports comparison of the available (i.e. successful) models, the pipeline failure costs the time to train those models, as one have to restart. If we support the requested feature, the failed operations will display an warning (may be ⚠️), and will go on to final training step. Then depending on whether that supports comparison of subset of all models, it will proceed as if the failed models were not there. If not, it'll fail there.

Very similar functionality in available in few CI tools. For example, Gitlab CI has allow_failure, Travis CI has allow_failures, etc.

Is there a workaround currently?

It is possible to do very broad top level exception handling to suppress failures. However, in this way the fact that it failed is hidden in the logs and not displayed in the pipeline dashboard. In scheduled pipelines where no one really go through the logs of all "successful" pipelines, these failures will go unnoticed.

Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.

chensun commented 2 years ago

Making it configurable sounds fine to me.

A possible workaround for now, albeit hacky, you could put your training into an exit handler task. In that case, it would run regardless whether the upstream tasks succeed or not.

yarnabrina commented 2 years ago

Hi @chensun, thanks for taking this. Would be very nice to have this.

Regarding the workaround you suggested, can you please give some more details? I did find about kfp.dsl.ExitHandler before, but there doesn't seem to be much documentation and/or examples online, and in my attempts I failed to apply an exit task for a few specific operations, as it complained that it has to be global for all.

marrrcin commented 2 years ago

General workaround is to always return status code 0 from the pipeline steps and e.g. return some output (e.g. a string OK or FAIL) instead, which can be chained together with dsl.Condition to verify whether to continue the pipeline or not in a specific branch of the ParallelFor.

This workaround does not cover the overall pipeline status, so no warning ⚠️ signs / red statuses - everything is green ✅

marrrcin commented 2 years ago

@yarnabrina , @chensun I've created a pull request implementing this behaviour - I would really appreciate your feedback on that https://github.com/kubeflow/pipelines/pull/7373.

thesuperzapper commented 2 years ago

I think most situations can be handled with an kfp.dsl.ExitHandler, which runs a single ContainerOp regardless of if the ContainerOps it wraps succeed or fail.

But we might consider making functionality like ExitHandler more "implicit" by having an airflow-style trigger_rule flag on the ContainerOp. (Proposed in issue: https://github.com/kubeflow/pipelines/issues/2372)

marrrcin commented 2 years ago

I don't fully understand where it's "most situations". Are multiple exit handlers now supported in KFP? As far as I can see, no https://github.com/kubeflow/pipelines/blob/e2687ce5c22455bbde0ababb3ad46588d5ed7939/sdk/python/kfp/compiler/compiler.py#L236 , so in common scenario of branch-out with ParallelFor (as in my PR) - it cannot be used.

thesuperzapper commented 2 years ago

@marrrcin I agree that only having a single ExitHandler is problematic, would allowing multiple also address your issue?

marrrcin commented 2 years ago

I don't think so - it's not a single job launched multiple times in parallel, it's a chain of consecutive jobs of which some might be allowed to fail - you can take a look at the screenshot in my PR (https://github.com/kubeflow/pipelines/pull/7373). Even having multiple exit handlers would not cover that imho.

jszendre commented 2 years ago

Exit handler ops cannot explicitly depend on any previous operations so they cannot be parameterized by outputs of previous operations or be guaranteed to run after previous steps.

My use case is having integration tests run that are themselves kubeflow pipelines and I would like to be able to verify that a task fails without the integration test failing. Configuring that in the dsl would be a lot cleaner than being included in application logic or directly in ci/cd.

techwithshadab commented 1 year ago

I also have a similar scenario, any work around this yet?

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 4 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

ckanaar commented 3 months ago

/reopen

google-oss-prow[bot] commented 3 months ago

@ckanaar: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubeflow/pipelines/issues/7330#issuecomment-2244738125): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

thesuperzapper commented 3 months ago

/reopen

google-oss-prow[bot] commented 3 months ago

@thesuperzapper: Reopened this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/7330#issuecomment-2249569313): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

HumairAK commented 3 weeks ago

/lifecycle frozen

kubeflow / pipelines