Open terrytangyuan opened 4 years ago
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/feature | 0.77 |
area/operator | 0.85 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
feature | 0.77 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Having success/failure would be great which would be easier for different frameworks to handle errors and it help make reconciler logic extensible.
With fault-tolerant & elastic distributed training propagating among more frameworks, a universal definition of failure and success for a distributed training job shall benefit developers for clarifying logic when handling pods failed or recently joined.
We recently added
SuccessPolicy
in tf-operator https://github.com/kubeflow/tf-operator/pull/1165 and are considering addingFailurePolicy
to handle the case of failure in https://github.com/kubeflow/tf-operator/issues/1170. Once it's mature and if we see a common pattern in other operators, we should consider moving that to kubeflow/common.cc @gaocegege @Jeffwan @johnugeorge @ChanYiLin @pingsutw