kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.51k stars 672 forks source link

Graceful Descheduling instead of Eviction #1558

Open B1F030 opened 1 week ago

B1F030 commented 1 week ago

Is your feature request related to a problem? Please describe.

Is there a graceful resolution(like rolling update) for Eviction? For now, descheduler evicts the pods which will cause a service interruption. And I wonder that can I customize the config to make it Restart the pods instead of Eviction? like this:

kubectl rollout restart deployment/abc

Describe the solution you'd like

Provide an optional config to restart pods instead of evict pods, so that the service will not be interrupted.

Describe alternatives you've considered

Or create new pods before evict old pods, when new pods are ready, old pods can be deleted.

What version of descheduler are you using?

descheduler version: v0.31.0

Additional context

a7i commented 1 week ago

For now, descheduler evicts the pods which will cause a service interruption.

Would you please elaborate on why that is, given that it uses the eviction api. Do you define a PodDisruptionBudget?

B1F030 commented 1 week ago

Would you please elaborate on why that is, given that it uses the eviction api. Do you define a PodDisruptionBudget?

Sure, I'll provide more details about our scenario:

We have two kubernetes gpu node pools, A as monthly(one node) and B as elastic(zero node, but with autoscaler). now a deployment with one replicas using gpu is running on A(exclusive to all resources of one node), when we rolling update it, it will trigger the autoscaler, and be scheduled to B, then A will be in low usage.

Since monthly node is cheaper, we want the pod to be rescheduled and go back to A, so that the elastic node can be recovered to zero.

In conclusion, this workload takes up almost all of resources on one node, and there's only one replicas so we can't use PDB(using multiple replicas will increase cost). We hope that, when rolling update, it will be scheduled to the elastic node. After rolling update is done, trigger the reschedule and evict the workload to monthly node(depends on preferredDuringSchedulingIgnoredDuringExecution).

Also we don't want the service interrupted, so I'm looking for a graceful method to reschedule(create the pod before evict it, just like rolling update or kubectl rollout restart deployment).