Consider implementing pod (anti)affinity checks in NodeFit

logyball commented 8 months ago

Is your feature request related to a problem? Please describe. Hello, we have been running into a situation which was somewhat surprising until I dug into the code and learned that NodeFit does not take into account all scheduler predicates, notably pod (anti)Affinity. At our organization, we've had multiple occasions where using descheduler HighNodeUtilization profile + GKE's optimize-utilization autoscaling profile has bit us because we assumed that NodeFit and the kube-scheduler would agree.

The basic situation is:

Node A contains Pod X
Node B contains Pod Y
Pod X and Pod Y have a pod anti-affinity rule that prevents them from being scheduled on the same node
Node A is considered "underutilized" by the descheduler
Node B is not considered "underutilized"
NodeFit algorithm is run and determines that Pod X can schedule on Node B
descheduler evicts Pod X
kube-scheduler determines Pod X cannot schedule on Node B because of Pod X/Y's anti-affinity
kube-scheduler schedules Pod X on Node A
The cycle continues with each loop of the descheduler

Describe the solution you'd like I suggest that we look into implementing Pod (anti)affinity checks into the NodeFit calculation, and would be happy to look into contributing it. It probably adds a decent amount of complexity, but would likely make the descheduler's decisions more closely resemble that of the kube-scheduler, what do you think?

Describe alternatives you've considered

Do nothing: perhaps call out in the documentation that the NodeFit arg does not take into account pod (anti)affinity

What version of descheduler are you using?

descheduler version: v0.27.x

logyball commented 8 months ago

Also, in a bit of follow up, after digging into a bit of how the scheduler implements these checks, it's not easy to parse for me. I'd love some guidance here if this is an avenue we'd like to pursue. It looks like in the past (last seen in v1.9.11 of k8s), there was a Predicate checker that we could probably have used to good effect. However, now the ecosystem looks a bit more complex.

gyuvaraj10 commented 7 months ago

We have also run this issue with the PodAntiAffinity in our case is a preferred choice. We have noticed that after the pod eviction the node still remained and the pod is scheduled onto the same node as the cluster autoscaler did not scale down the node immediately.

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

logyball commented 4 months ago

/remove-lifecycle stale

nikimanoledaki commented 4 months ago

Hi! 👋 I am interested in picking up this issue and contributing a PR to add this feature.

In the scenario outlined above by @logyball, NodeFit() runs and determines Pod X can be scheduled on Node B when it shouldn't. The happy path would be that NodeFit() should run and determine that Pod X cannot be scheduled on Node B.

@a7i mentioned here that this could be implemented as a patch.

The solution would be to add a predicate to NodeFit() so that it considers pod anti-affinity. Something like this:

func NodeFit(nodeIndexer podutil.GetPodsAssignedToNodeFunc, pod *v1.Pod, node *v1.Node) []error {

...
    // Check if pod matches pod anti-affinity rule of other pod on node
    if ok, err := utils.PodMatchesAntiAffinityRule(pod, node); err != nil { // add to pkg/utils/predicates.go
        errors = append(errors, err)
    } else if !ok {
        errors = append(errors, fmt.Errorf("pod violates pod anti-affinity rule of other pod on the node"))
    }

    return errors
}

Implementation detail: Refactor the plugin RemovePodsViolatingInterPodAntiAffinity which already contains logic for fetching and comparing the pod anti-affinity rule of pods in checkPodsWithAntiAffinityExist. This logic could be moved to pkg/utils/predicates.go and used for NodeFit() as well.

Happy to assign myself to this issue and get started when I get the green light from SIG Scheduling/descheduler maintainers :)

logyball commented 4 months ago

/kind feature

nikimanoledaki commented 4 months ago

/assign

kubernetes-sigs / descheduler

Consider implementing pod (anti)affinity checks in NodeFit #1279