kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.23k stars 645 forks source link

Add exitCode to RemoveFailedPods strategy #1380

Closed yuanchen8911 closed 2 months ago

yuanchen8911 commented 2 months ago

Is your feature request related to a problem? Please describe.

Current RemoveFailedPods strategy includes a parameter reason from a terminated container's status (state). In addition to reason, the field exitCode in a container's status, which describes the exit status from the last termination of a container, can provide additional and important information about a container's termination.

A common use case is AI/ML training jobs often inject/run pre-flight health checks in initContainers and take actions according to the exitCode value when an initContainer fails, e.g., deleting the scheduled job pod via Descheduler.

Describe the solution you'd like

I'd like to propose adding a terminated container's exitCode as an additional parameter to the RemoveFailedPods strategy. The implementation should be straightforward by checking status.containerStatuses.state.terminated.exitCode. If it makes sense, I will submit an implementation.

Describe alternatives you've considered

What version of descheduler are you using?

descheduler version: the development version in the main branch

Additional context

yuanchen8911 commented 2 months ago

Submitted a PR for implemeantion: https://github.com/kubernetes-sigs/descheduler/pull/1381.