kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

Limit the number of restarts under ExitCode restartPolicy #167

Closed goyalankit closed 2 years ago

goyalankit commented 2 years ago

In the current implementation, the runPolicy.BackoffLimit is only applicable on OnFailure and Always restart policies. I was wondering if there's a reason it's not supported on the ExitCode restart policy. Often in the case of OOM, the job exits with an exit code of 137 which is rightfully a retriable error. However, it will keep restarting indefinitely since the policy is not covered by the BackOffLimit. The behavior could be to honor the backOffLimit if present, else it keeps retrying indefinitely?

I am happy to submit a PR if you think this is a reasonable change.

gaocegege commented 2 years ago

Hi @goyalankit . Thanks for the issue.

I think it makes sense, WDYT @kubeflow/wg-training-leads

Jeffwan commented 2 years ago

I think this is reasonable improvement. @goyalankit Feel free to cut a PR and assign to us to help review

johnugeorge commented 2 years ago

Thanks for this

Jeffwan commented 2 years ago

Let's keep it open to track cherry-pick

/reopen

google-oss-robot commented 2 years ago

@Jeffwan: Reopened this issue.

In response to [this](https://github.com/kubeflow/common/issues/167#issuecomment-945877853): >Let's keep it open to track cherry-pick > >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.