kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

Handle pod failures for all policies #195

Closed yoanisgil closed 2 years ago

yoanisgil commented 2 years ago

If a pod is in phase failure we have to create a new one. Currently it was assumed the pod would restart due to a RestartPolicy on the pod level. This doesn't work if the pod fails for a system reason.

google-oss-prow[bot] commented 2 years ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign terrytangyuan after the PR has been reviewed. You can assign the PR to them by writing /assign @terrytangyuan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/common/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
johnugeorge commented 2 years ago

Can you add a test case?

I see a related PR for exit code - https://github.com/kubeflow/common/pull/190

/cc @gaocegege @terrytangyuan

yoanisgil commented 2 years ago

No need to merge this one as https://github.com/kubeflow/common/pull/189 was already merged.