kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.23k stars 645 forks source link

fix the issue that the pod anti-filtering rules are not taking effect #1395

Closed fanhaouu closed 4 weeks ago

fanhaouu commented 1 month ago

Currently, the "podMatchesInterPodAntiAffinity" method has an issue. It is unable to check if the pod matches the anti-affinity rule of another pod that is already on the given node.

The root cause is that the "CheckPodsWithAntiAffinityExist" method in "predicate.go" can only check if pods with AntiAffinity issues exist on the same node

This pr can resolve that.

k8s-ci-robot commented 1 month ago

Hi @fanhaouu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
ingvagabund commented 1 month ago

@fanhaouu would you please share an example of this case? With nodeFit=true the pod anti-affinity check is completely skipped.

Currently, the "podMatchesInterPodAntiAffinity" method has an issue. It is unable to check if the pod matches the anti-affinity rule of another pod that is already on the given node. The root cause is that the "CheckPodsWithAntiAffinityExist" method in "predicate.go" can only check if pods with AntiAffinity issues exist on the same node

Can you rephrase the description in terms of "a pod P1 on a node N1", "a pod P1 wrt. P2", etc.? The description is still ambiguous. It would help to start the description with "There are two nodes N1 and N2. Both pods P1 and P2 are scheduled. P1 to node N1, P2 to node N2. There's another pod P3 scheduled to node N? that violates anti-affinity of P?. The current algorithm does THIS for pod P? instead of THIS. etc.".

fanhaouu commented 1 month ago

@fanhaouu would you please share an example of this case? With nodeFit=true the pod anti-affinity check is completely skipped.

Currently, the "podMatchesInterPodAntiAffinity" method has an issue. It is unable to check if the pod matches the anti-affinity rule of another pod that is already on the given node. The root cause is that the "CheckPodsWithAntiAffinityExist" method in "predicate.go" can only check if pods with AntiAffinity issues exist on the same node

Can you rephrase the description in terms of "a pod P1 on a node N1", "a pod P1 wrt. P2", etc.? The description is still ambiguous. It would help to start the description with "There are two nodes N1 and N2. Both pods P1 and P2 are scheduled. P1 to node N1, P2 to node N2. There's another pod P3 scheduled to node N? that violates anti-affinity of P?. The current algorithm does THIS for pod P? instead of THIS. etc.".

Sure, let me give you an example.

If according to the current logic of podMatchesInterPodAntiAffinity, to determine if pod1 can be scheduled to node2, this method returns match as false and error as nil, but in reality, the match value should be true, meaning that pod1 cannot be scheduled to node2

Pod1:

apiVersion: v1
kind: Pod
metadata:
  name: pod1
spec:
  nodeName: node1
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2
          topologyKey: topology.kubernetes.io/zone

Node1:

apiVersion: v1
kind: Node
metadata:
  labels:
    topology.kubernetes.io/zone: xx-1
  name: node1

Pod2:

apiVersion: v1
kind: Pod
metadata:
  labels:
    security: S2
  name: pod2
spec:
  nodeName: node2
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2
          topologyKey: topology.kubernetes.io/zone

Node2:

apiVersion: v1
kind: Node
metadata:
  labels:
    topology.kubernetes.io/zone: xx-1
  name: node2
fanhaouu commented 1 month ago

@ingvagabund I don't know if the example I provided is clear enough?

ingvagabund commented 1 month ago

If according to the current logic of podMatchesInterPodAntiAffinity, to determine if pod1 can be scheduled to node2, this method returns match as false and error as nil, but in reality, the match value should be true, meaning that pod1 cannot be scheduled to node2

So it's the nodefit part here that's in question? I.e. when pod1 is to be evicted you'd expect the descheduler to check whether there's a node where pod1 (resp. it's clone) can be scheduled to?

but in reality, the match value should be true, meaning that pod1 cannot be scheduled to node2

If the match is true, the strategy is expected to evict the pod. Independent of whether the pod can be scheduled to another node or not. If you want to implement the nodefit part for the strategy you can utilize sigs.k8s.io/descheduler/pkg/descheduler/node.PodFitsAnyOtherNode as e.g. in https://github.com/kubernetes-sigs/descheduler/blob/bb5930eb21605879104e02d9da62bd96180d9492/pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go#L364.

ingvagabund commented 1 month ago

Is this PR expected to resolve https://github.com/kubernetes-sigs/descheduler/issues/1370?

ingvagabund commented 1 month ago

I see now what's going on:

Here, we evaluate whether both Pods are present on the Node. It only makes sense to check this for the plugin where this function originated from, RemovePodsViolatingInterPodAntiAffinity. That plugin checks if both Pods are on the same Node and whether to evict one of them, if they have matching antiaffinities. It does not make sense to do this when we’re evaluating if a Pod could fit on another Node as part of NodeFit.

You are addressing this part: "It does not make sense to do this when we’re evaluating if a Pod could fit on another Node as part of NodeFit."

fanhaouu commented 1 month ago

Is this PR expected to resolve #1370?

yes, master. can fix #1370

ingvagabund commented 1 month ago

It should be assessing whether a Pod fits on another Node based on whether another Pod on this other Node has matching antiaffinities, but it returns early, due to the following issue.

Let's have P1 scheduled to N1 and P2 scheduled to N2. Pods P1 and P2 have matching anti-affinities. The check here is NodeFit(P1,N2) and the claim here is the check "returns early". Because of https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/utils/predicates.go#L358:

func hasSameLabelValue(node1, node2 *v1.Node, key string) bool {
    if node1.Name == node2.Name { // returns early
        return true
    }
...
}

So the goal here is to ignore node N1 in the check. N1 is retrieved due to accessing pod.Spec.NodeName in https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/utils/predicates.go#L326. Instead, just check whether P1 does/does not violate the anti-affinity rule (produced by P2) on node N2 by checking node N2's labels (through TopologyKey key).

@fanhaouu is this the current issue in hand? Just rephrasing so it is clear what this issue and fix is all about.

@nikimanoledaki for awareness.

fanhaouu commented 1 month ago

It should be assessing whether a Pod fits on another Node based on whether another Pod on this other Node has matching antiaffinities, but it returns early, due to the following issue.

Let's have P1 scheduled to N1 and P2 scheduled to N2. Pods P1 and P2 have matching anti-affinities. The check here is NodeFit(P1,N2) and the claim here is the check "returns early". Because of https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/utils/predicates.go#L358:

func hasSameLabelValue(node1, node2 *v1.Node, key string) bool {
  if node1.Name == node2.Name { // returns early
      return true
  }
...
}

So the goal here is to ignore node N1 in the check. N1 is retrieved due to accessing pod.Spec.NodeName in https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/utils/predicates.go#L326. Instead, just check whether P1 does/does not violate the anti-affinity rule (produced by P2) on node N2 by checking node N2's labels (through TopologyKey key).

@fanhaouu is this the current issue in hand? Just rephrasing so it is clear what this issue and fix is all about.

@nikimanoledaki for awareness.

yes, master, you are right

ingvagabund commented 1 month ago

It should be assessing whether a Pod fits on another Node based on whether another Pod on this other Node has matching antiaffinities, but it returns early, due to the following issue.

More on this. Again with let's have P1 scheduled to N1 and P2 scheduled to N2. Pods P1 and P2 have matching anti-affinities. The check here is NodeFit(P1,N2) and the claim here is the check "returns early".

Given N1 != N2 then hasSameLabelValue is true only when both N1 and N2 have the same Labels[TopologyKey] value. Meaning hasSameLabelValue returns at the end. @nikimanoledaki would you please elaborate more on what "returns early" means in context of https://github.com/kubernetes-sigs/descheduler/issues/1370? I have spent too much time looking at the code I am probably missing the obvious.

ingvagabund commented 1 month ago

Wrt. "returns early": In case CheckPodsWithAntiAffinityExist wants to detect that P1 is in violation with P2 on N2 while P1 and P2 are in "pod anti-affinity" relation and N2 has Labels[TopologyKey] (the actual value is irrelevant), CheckPodsWithAntiAffinityExist is expected to return true. Which can happen only when hasSameLabelValue returns true. Which is not the case when N1 (which is different from N2) leads to hasSameLabelValue returning false. Causing CheckPodsWithAntiAffinityExist returns false, which turns into NodeFit not erroring on podMatchesInterPodAntiAffinity.

So the following case leads to NodeFit(P1, N2) behaving incorrectly:

  1. P1 on N1, P2 on N2, P1 in pod anti-affinity with P2, N1 != N2,
  2. N2 has Labels[TopologyKey], N1 does not have Labels[TopologyKey]
  3. NodeFit(P1, N2) returns no error instead of "pod matches inter-pod anti-affinity rule of other pod on node" since N1 is missing Labels[TopologyKey] causing hasSameLabelValue to return false
ingvagabund commented 1 month ago

/ok-to-test

likakuli commented 1 month ago

/cc

k8s-ci-robot commented 1 month ago

@likakuli: GitHub didn't allow me to request PR reviews from the following users: likakuli.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/kubernetes-sigs/descheduler/pull/1395#issuecomment-2126617440): >/cc Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
nikimanoledaki commented 1 month ago

Given N1 != N2 then hasSameLabelValue is true only when both N1 and N2 have the same Labels[TopologyKey] value.

This is correct - I came to the same conclusion that when there are multiple pods and multiple nodes (rather than P1 and P2 both on N1), given that both pods have an anti-affinity, a valid check to validate this is if they have the same TopologyKey.

I wrote a draft PR a while ago that I didn't have time to finish, here, which has more logging that I was using to debug in our environment. It approached this logic in a slightly different way: https://github.com/kubernetes-sigs/descheduler/pull/1415/files

nikimanoledaki commented 1 month ago

would you please elaborate more on what "returns early" means in context of https://github.com/kubernetes-sigs/descheduler/issues/1370?

It returned early here:

// assuming that we have two pods, P1 and P2,
// and assuming that we have two nodes, N1 and N2, 
// where P1 is on N1 and P2 is on N2,
// the following will return early because it evaluates if P1 is on N2, but it isn't.
node, ok := nodeMap[pod.Spec.NodeName] 

As you can see in the PR I used to debug, here, I added the following log:

node, ok := nodeMap[pod.Spec.NodeName]
if !ok {
    klog.V(4).InfoS("antiaffinity check: pod being evaluated not in node", "pod", klog.KObj(pod), "node", klog.KObjs(node))
    continue
}
ingvagabund commented 1 month ago

node, ok := nodeMap[pod.Spec.NodeName]

Normally, the descheduler is expected to check only a pod that has been scheduled. Which implies nodeMap[pod.Spec.NodeName] will always exist. Though, I can see how CheckPodsWithAntiAffinityExist can be functionality tested over a pod that has not been scheduled yet.

nikimanoledaki commented 1 month ago

the descheduler is expected to check only a pod that has been scheduled. Which implies nodeMap[pod.Spec.NodeName] will always exist.

Hm not sure we are seeing the same thing. Taking a step back, we know that NodeFit should be able to check whether P1, which is on N1, can be scheduled on another node, N2. With this anti-affinity check, it should check if there is an existing Pod on N2, P2, which has anti-affinity with P1.

Currently, the code will check whether P1 and P2 are on N2. Of course, we know that P1 is not yet on N2, so node, ok := nodeMap[pod.Spec.NodeName] will always return !ok and then return early (continue). Therefore, we don't want to check this if we are coming from NodeFit where P1 is not on N2 (yet).

I think the issue is that this was a leftover/oversight from my part when I was refactoring this in https://github.com/kubernetes-sigs/descheduler/pull/1356 where I tried to refactor this anti-affinity check from RemovePodsViolatingInterPodAntiAffinit. That plugin was comparing P1 and P2 when they are already both on the same Node. On the other hand, when coming from NodeFit, it should be able to check when P1 is on N1 and P2 is on N2. I hope this makes some sense 🙈

ingvagabund commented 1 month ago

Currently, the code will check whether P1 and P2 are on N2. Of course, we know that P1 is not yet on N2, so node, ok := nodeMap[pod.Spec.NodeName] will always return !ok and then return early (continue). Therefore, we don't want to check this if we are coming from NodeFit where P1 is not on N2 (yet).

P1 is scheduled on N1 so I wonder why you get !ok when accessing nodeMap[P1.Spec.NodeName]

nikimanoledaki commented 1 month ago

P1 is scheduled on N1 so I wonder why you get !ok when accessing nodeMap[P1.Spec.NodeName]

Because the Node that we are checking for in nodeMap is in fact N2! It's the destination Node that we are evaluating against, not the origin Node...

It's checking if P1 is on N2 (which is not expected to be there yet, therefore it returns early) and then if P2 is on N2 (which we expect it to be).

At least, this is my understanding - I could be wrong.

nikimanoledaki commented 1 month ago

This is my understanding, let me know if this is different from yours

Screenshot 2024-05-23 at 15 55 50
fanhaouu commented 1 month ago

/retest

ingvagabund commented 1 month ago

@fanhaouu apologies for pushing the other commit. I mis-clicked.

fanhaouu commented 1 month ago

@fanhaouu apologies for pushing the other commit. I mis-clicked.

haha, let me resubmit it.

fanhaouu commented 1 month ago

/retest

fanhaouu commented 1 month ago

I made adjustments, to no longer pursue reuse CheckPodsWithAntiAffinityExist method, but realized podMatchesInterPodAntiAffinity method alone, at the same time by adjusting part of the method of parameter names and internal logic. @ingvagabund @nikimanoledaki Looking forward to a re-review and merging as soon as possible to avoid impacting users.

ingvagabund commented 4 weeks ago

Very nice!!! @fanhaouu thank you for your patience and improvements. /approve Leaving the final lgtm for @nikimanoledaki

k8s-ci-robot commented 4 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ingvagabund

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/descheduler/blob/master/OWNERS)~~ [ingvagabund] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
ingvagabund commented 4 weeks ago

/lgtm