Open yarnabrina opened 2 years ago
Making it configurable sounds fine to me.
A possible workaround for now, albeit hacky, you could put your training into an exit handler task. In that case, it would run regardless whether the upstream tasks succeed or not.
Hi @chensun, thanks for taking this. Would be very nice to have this.
Regarding the workaround you suggested, can you please give some more details? I did find about kfp.dsl.ExitHandler
before, but there doesn't seem to be much documentation and/or examples online, and in my attempts I failed to apply an exit task for a few specific operations, as it complained that it has to be global for all.
This is related to https://github.com/kubeflow/pipelines/issues/6749
General workaround is to always return status code 0 from the pipeline steps and e.g. return some output (e.g. a string OK
or FAIL
) instead, which can be chained together with dsl.Condition
to verify whether to continue the pipeline or not in a specific branch of the ParallelFor
.
This workaround does not cover the overall pipeline status, so no warning ⚠️ signs / red statuses - everything is green ✅
@yarnabrina , @chensun I've created a pull request implementing this behaviour - I would really appreciate your feedback on that https://github.com/kubeflow/pipelines/pull/7373.
I think most situations can be handled with an kfp.dsl.ExitHandler
, which runs a single ContainerOp regardless of if the ContainerOps it wraps succeed or fail.
But we might consider making functionality like ExitHandler more "implicit" by having an airflow-style trigger_rule
flag on the ContainerOp. (Proposed in issue: https://github.com/kubeflow/pipelines/issues/2372)
I don't fully understand where it's "most situations". Are multiple exit handlers now supported in KFP? As far as I can see, no https://github.com/kubeflow/pipelines/blob/e2687ce5c22455bbde0ababb3ad46588d5ed7939/sdk/python/kfp/compiler/compiler.py#L236 , so in common scenario of branch-out with ParallelFor (as in my PR) - it cannot be used.
@marrrcin I agree that only having a single ExitHandler is problematic, would allowing multiple also address your issue?
I don't think so - it's not a single job launched multiple times in parallel, it's a chain of consecutive jobs of which some might be allowed to fail - you can take a look at the screenshot in my PR (https://github.com/kubeflow/pipelines/pull/7373). Even having multiple exit handlers would not cover that imho.
Exit handler ops cannot explicitly depend on any previous operations so they cannot be parameterized by outputs of previous operations or be guaranteed to run after previous steps.
My use case is having integration tests run that are themselves kubeflow pipelines and I would like to be able to verify that a task fails without the integration test failing. Configuring that in the dsl would be a lot cleaner than being included in application logic or directly in ci/cd.
I also have a similar scenario, any work around this yet?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
/reopen
@ckanaar: You can't reopen an issue/PR unless you authored it or you are a collaborator.
/reopen
@thesuperzapper: Reopened this issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen
Feature Area
/area backend
What feature would you like to see?
Allow failure of jobs - If an operation fails, do not fail the pipeline. Allow the pipeline to continue to the next stage, and there it may fail if that does not have the pre-requisites.
What is the use case or pain point?
In the machine learning pipelines, it is fairly common to run multiple models, or possibly different configurations of same model, and this possibly runs on a subset of training data. After these are trained, usually they are compared using some metric, the best model is chosen, and that is run on the entire training data to have the final trained model.
If someone uses
kfp.dsl.ParallelFor
to run the different models, failure in one of them causes the entire pipeline to fail and possibly successful training of other models are lost. But if the next stage, the one to compare using metric supports comparison of the available (i.e. successful) models, the pipeline failure costs the time to train those models, as one have to restart. If we support the requested feature, the failed operations will display an warning (may be ⚠️), and will go on to final training step. Then depending on whether that supports comparison of subset of all models, it will proceed as if the failed models were not there. If not, it'll fail there.Very similar functionality in available in few CI tools. For example, Gitlab CI has
allow_failure
, Travis CI hasallow_failures
, etc.Is there a workaround currently?
It is possible to do very broad top level exception handling to suppress failures. However, in this way the fact that it failed is hidden in the logs and not displayed in the pipeline dashboard. In scheduled pipelines where no one really go through the logs of all "successful" pipelines, these failures will go unnoticed.
Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.