kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.23k stars 645 forks source link

Add an extra parameter ExitCode to RemoveFailedPods #1381

Closed yuanchen8911 closed 2 months ago

yuanchen8911 commented 2 months ago

The current RemoveFailedPods strategy includes a parameter reason from a terminated container's status (state). In addition to reason, the field exitCode in a container's status, which describes the exit status from the last termination of a container, can provide additional and useful information about a container's termination.

A common use case is AI/ML training jobs often inject/run pre-flight health checks in initContainers and take actions according to the exitCode value when an initContainer fails, e.g., deleting the scheduled job pod via Descheduler.

This PR adds a terminated container's exitCode as an additional parameter to the RemoveFailedPods strategy.

Fix https://github.com/kubernetes-sigs/descheduler/issues/1380

yuanchen8911 commented 2 months ago

/cc @Huang-Wei

jklaw90 commented 2 months ago

/ok-to-test

jklaw90 commented 2 months ago

@yuanchen8911 could you update the readme

yuanchen8911 commented 2 months ago

@yuanchen8911 could you update the readme

Updated README.md.

yuanchen8911 commented 2 months ago

@jklaw90, may I have it reviewed? Thanks!

jklaw90 commented 2 months ago

/lgtm

Thanks for your pr @yuanchen8911

yuanchen8911 commented 2 months ago

/lgtm

Thanks for your pr @yuanchen8911

Would you approve it as well? Thanks!

yuanchen8911 commented 2 months ago

@ingvagabund, thanks for reviewing the PR. I've made the suggested changes to address your comments. Can you take another look please?

jklaw90 commented 2 months ago

/lgtm

a7i commented 2 months ago

I believe this would be the first time that we're adding a field to v1alpha2 (but not v1alpha1). I think that's ok, given that it's being deprecated and removed in 1.31.

Just wanted to note that.

a7i commented 2 months ago

/approve as I believe all of @ingvagabund comments were addressed

/lgtm as I believe @jklaw90 already did

Thank you for your contribution @yuanchen8911 and being patient with us

k8s-ci-robot commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a7i

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/descheduler/blob/master/OWNERS)~~ [a7i] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
a7i commented 2 months ago

This will be included as part of v0.30

soon 🤞🏼 ™️

a7i commented 2 months ago

/lgtm

yuanchen8911 commented 2 months ago

/approve as I believe all of @ingvagabund comments were addressed

/lgtm as I believe @jklaw90 already did

Thank you for your contribution @yuanchen8911 and being patient with us

Thank you all for reviewing the PR!