kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.24k stars 645 forks source link

Consider implementing pod (anti)affinity checks in NodeFit #1279

Closed logyball closed 3 months ago

logyball commented 8 months ago

Is your feature request related to a problem? Please describe. Hello, we have been running into a situation which was somewhat surprising until I dug into the code and learned that NodeFit does not take into account all scheduler predicates, notably pod (anti)Affinity. At our organization, we've had multiple occasions where using descheduler HighNodeUtilization profile + GKE's optimize-utilization autoscaling profile has bit us because we assumed that NodeFit and the kube-scheduler would agree.

The basic situation is:

Describe the solution you'd like I suggest that we look into implementing Pod (anti)affinity checks into the NodeFit calculation, and would be happy to look into contributing it. It probably adds a decent amount of complexity, but would likely make the descheduler's decisions more closely resemble that of the kube-scheduler, what do you think?

Describe alternatives you've considered

What version of descheduler are you using?

descheduler version: v0.27.x

logyball commented 8 months ago

Also, in a bit of follow up, after digging into a bit of how the scheduler implements these checks, it's not easy to parse for me. I'd love some guidance here if this is an avenue we'd like to pursue. It looks like in the past (last seen in v1.9.11 of k8s), there was a Predicate checker that we could probably have used to good effect. However, now the ecosystem looks a bit more complex.

gyuvaraj10 commented 7 months ago

We have also run this issue with the PodAntiAffinity in our case is a preferred choice. We have noticed that after the pod eviction the node still remained and the pod is scheduled onto the same node as the cluster autoscaler did not scale down the node immediately.

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

logyball commented 4 months ago

/remove-lifecycle stale

nikimanoledaki commented 4 months ago

Hi! 👋 I am interested in picking up this issue and contributing a PR to add this feature.

In the scenario outlined above by @logyball, NodeFit() runs and determines Pod X can be scheduled on Node B when it shouldn't. The happy path would be that NodeFit() should run and determine that Pod X cannot be scheduled on Node B.

@a7i mentioned here that this could be implemented as a patch.

The solution would be to add a predicate to NodeFit() so that it considers pod anti-affinity. Something like this:

func NodeFit(nodeIndexer podutil.GetPodsAssignedToNodeFunc, pod *v1.Pod, node *v1.Node) []error {

...
    // Check if pod matches pod anti-affinity rule of other pod on node
    if ok, err := utils.PodMatchesAntiAffinityRule(pod, node); err != nil { // add to pkg/utils/predicates.go
        errors = append(errors, err)
    } else if !ok {
        errors = append(errors, fmt.Errorf("pod violates pod anti-affinity rule of other pod on the node"))
    }

    return errors
}

Implementation detail: Refactor the plugin RemovePodsViolatingInterPodAntiAffinity which already contains logic for fetching and comparing the pod anti-affinity rule of pods in checkPodsWithAntiAffinityExist. This logic could be moved to pkg/utils/predicates.go and used for NodeFit() as well.

Happy to assign myself to this issue and get started when I get the green light from SIG Scheduling/descheduler maintainers :)

logyball commented 4 months ago

/kind feature

nikimanoledaki commented 4 months ago

/assign