kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

Consider supporting SuccessPolicy and FailurePolicy #99

Open terrytangyuan opened 4 years ago

terrytangyuan commented 4 years ago

We recently added SuccessPolicy in tf-operator https://github.com/kubeflow/tf-operator/pull/1165 and are considering adding FailurePolicy to handle the case of failure in https://github.com/kubeflow/tf-operator/issues/1170. Once it's mature and if we see a common pattern in other operators, we should consider moving that to kubeflow/common.

cc @gaocegege @Jeffwan @johnugeorge @ChanYiLin @pingsutw

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.77
area/operator 0.85

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

kf-label-bot-dev[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.77

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Jeffwan commented 4 years ago

Having success/failure would be great which would be easier for different frameworks to handle errors and it help make reconciler logic extensible.

zw0610 commented 4 years ago

With fault-tolerant & elastic distributed training propagating among more frameworks, a universal definition of failure and success for a distributed training job shall benefit developers for clarifying logic when handling pods failed or recently joined.