Closed fanhaouu closed 4 weeks ago
Hi @fanhaouu. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test
on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test
label.
I understand the commands that are listed here.
@fanhaouu would you please share an example of this case? With nodeFit=true
the pod anti-affinity check is completely skipped.
Currently, the "podMatchesInterPodAntiAffinity" method has an issue. It is unable to check if the pod matches the anti-affinity rule of another pod that is already on the given node. The root cause is that the "CheckPodsWithAntiAffinityExist" method in "predicate.go" can only check if pods with AntiAffinity issues exist on the same node
Can you rephrase the description in terms of "a pod P1 on a node N1", "a pod P1 wrt. P2", etc.? The description is still ambiguous. It would help to start the description with "There are two nodes N1 and N2. Both pods P1 and P2 are scheduled. P1 to node N1, P2 to node N2. There's another pod P3 scheduled to node N? that violates anti-affinity of P?. The current algorithm does THIS for pod P? instead of THIS. etc.".
@fanhaouu would you please share an example of this case? With
nodeFit=true
the pod anti-affinity check is completely skipped.Currently, the "podMatchesInterPodAntiAffinity" method has an issue. It is unable to check if the pod matches the anti-affinity rule of another pod that is already on the given node. The root cause is that the "CheckPodsWithAntiAffinityExist" method in "predicate.go" can only check if pods with AntiAffinity issues exist on the same node
Can you rephrase the description in terms of "a pod P1 on a node N1", "a pod P1 wrt. P2", etc.? The description is still ambiguous. It would help to start the description with "There are two nodes N1 and N2. Both pods P1 and P2 are scheduled. P1 to node N1, P2 to node N2. There's another pod P3 scheduled to node N? that violates anti-affinity of P?. The current algorithm does THIS for pod P? instead of THIS. etc.".
Sure, let me give you an example.
If according to the current logic of podMatchesInterPodAntiAffinity, to determine if pod1 can be scheduled to node2, this method returns match as false and error as nil, but in reality, the match value should be true, meaning that pod1 cannot be scheduled to node2
Pod1:
apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
nodeName: node1
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
Node1:
apiVersion: v1
kind: Node
metadata:
labels:
topology.kubernetes.io/zone: xx-1
name: node1
Pod2:
apiVersion: v1
kind: Pod
metadata:
labels:
security: S2
name: pod2
spec:
nodeName: node2
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
Node2:
apiVersion: v1
kind: Node
metadata:
labels:
topology.kubernetes.io/zone: xx-1
name: node2
@ingvagabund I don't know if the example I provided is clear enough?
If according to the current logic of podMatchesInterPodAntiAffinity, to determine if pod1 can be scheduled to node2, this method returns match as false and error as nil, but in reality, the match value should be true, meaning that pod1 cannot be scheduled to node2
So it's the nodefit part here that's in question? I.e. when pod1
is to be evicted you'd expect the descheduler to check whether there's a node where pod1
(resp. it's clone) can be scheduled to?
but in reality, the match value should be true, meaning that pod1 cannot be scheduled to node2
If the match is true, the strategy is expected to evict the pod. Independent of whether the pod can be scheduled to another node or not. If you want to implement the nodefit part for the strategy you can utilize sigs.k8s.io/descheduler/pkg/descheduler/node.PodFitsAnyOtherNode
as e.g. in https://github.com/kubernetes-sigs/descheduler/blob/bb5930eb21605879104e02d9da62bd96180d9492/pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go#L364.
Is this PR expected to resolve https://github.com/kubernetes-sigs/descheduler/issues/1370?
I see now what's going on:
Here, we evaluate whether both Pods are present on the Node. It only makes sense to check this for the plugin where this function originated from, RemovePodsViolatingInterPodAntiAffinity. That plugin checks if both Pods are on the same Node and whether to evict one of them, if they have matching antiaffinities. It does not make sense to do this when we’re evaluating if a Pod could fit on another Node as part of NodeFit.
You are addressing this part: "It does not make sense to do this when we’re evaluating if a Pod could fit on another Node as part of NodeFit."
Is this PR expected to resolve #1370?
yes, master. can fix #1370
It should be assessing whether a Pod fits on another Node based on whether another Pod on this other Node has matching antiaffinities, but it returns early, due to the following issue.
Let's have P1 scheduled to N1 and P2 scheduled to N2. Pods P1 and P2 have matching anti-affinities. The check here is NodeFit(P1,N2)
and the claim here is the check "returns early". Because of https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/utils/predicates.go#L358:
func hasSameLabelValue(node1, node2 *v1.Node, key string) bool {
if node1.Name == node2.Name { // returns early
return true
}
...
}
So the goal here is to ignore node N1 in the check. N1 is retrieved due to accessing pod.Spec.NodeName
in https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/utils/predicates.go#L326. Instead, just check whether P1 does/does not violate the anti-affinity rule (produced by P2) on node N2 by checking node N2's labels (through TopologyKey
key).
@fanhaouu is this the current issue in hand? Just rephrasing so it is clear what this issue and fix is all about.
@nikimanoledaki for awareness.
It should be assessing whether a Pod fits on another Node based on whether another Pod on this other Node has matching antiaffinities, but it returns early, due to the following issue.
Let's have P1 scheduled to N1 and P2 scheduled to N2. Pods P1 and P2 have matching anti-affinities. The check here is
NodeFit(P1,N2)
and the claim here is the check "returns early". Because of https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/utils/predicates.go#L358:func hasSameLabelValue(node1, node2 *v1.Node, key string) bool { if node1.Name == node2.Name { // returns early return true } ... }
So the goal here is to ignore node N1 in the check. N1 is retrieved due to accessing
pod.Spec.NodeName
in https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/utils/predicates.go#L326. Instead, just check whether P1 does/does not violate the anti-affinity rule (produced by P2) on node N2 by checking node N2's labels (throughTopologyKey
key).@fanhaouu is this the current issue in hand? Just rephrasing so it is clear what this issue and fix is all about.
@nikimanoledaki for awareness.
yes, master, you are right
It should be assessing whether a Pod fits on another Node based on whether another Pod on this other Node has matching antiaffinities, but it returns early, due to the following issue.
More on this. Again with let's have P1 scheduled to N1 and P2 scheduled to N2. Pods P1 and P2 have matching anti-affinities. The check here is NodeFit(P1,N2) and the claim here is the check "returns early".
Given N1 != N2 then hasSameLabelValue
is true only when both N1 and N2 have the same Labels[TopologyKey]
value. Meaning hasSameLabelValue
returns at the end. @nikimanoledaki would you please elaborate more on what "returns early" means in context of https://github.com/kubernetes-sigs/descheduler/issues/1370? I have spent too much time looking at the code I am probably missing the obvious.
Wrt. "returns early": In case CheckPodsWithAntiAffinityExist
wants to detect that P1 is in violation with P2 on N2 while P1 and P2 are in "pod anti-affinity" relation and N2 has Labels[TopologyKey]
(the actual value is irrelevant), CheckPodsWithAntiAffinityExist
is expected to return true
. Which can happen only when hasSameLabelValue
returns true. Which is not the case when N1 (which is different from N2) leads to hasSameLabelValue
returning false. Causing CheckPodsWithAntiAffinityExist
returns false, which turns into NodeFit
not erroring on podMatchesInterPodAntiAffinity
.
So the following case leads to NodeFit(P1, N2)
behaving incorrectly:
Labels[TopologyKey]
, N1 does not have Labels[TopologyKey]
NodeFit(P1, N2)
returns no error instead of "pod matches inter-pod anti-affinity rule of other pod on node" since N1 is missing Labels[TopologyKey]
causing hasSameLabelValue
to return false
/ok-to-test
/cc
@likakuli: GitHub didn't allow me to request PR reviews from the following users: likakuli.
Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.
Given N1 != N2 then hasSameLabelValue is true only when both N1 and N2 have the same Labels[TopologyKey] value.
This is correct - I came to the same conclusion that when there are multiple pods and multiple nodes (rather than P1 and P2 both on N1), given that both pods have an anti-affinity, a valid check to validate this is if they have the same TopologyKey
.
I wrote a draft PR a while ago that I didn't have time to finish, here, which has more logging that I was using to debug in our environment. It approached this logic in a slightly different way: https://github.com/kubernetes-sigs/descheduler/pull/1415/files
would you please elaborate more on what "returns early" means in context of https://github.com/kubernetes-sigs/descheduler/issues/1370?
It returned early here:
// assuming that we have two pods, P1 and P2,
// and assuming that we have two nodes, N1 and N2,
// where P1 is on N1 and P2 is on N2,
// the following will return early because it evaluates if P1 is on N2, but it isn't.
node, ok := nodeMap[pod.Spec.NodeName]
As you can see in the PR I used to debug, here, I added the following log:
node, ok := nodeMap[pod.Spec.NodeName]
if !ok {
klog.V(4).InfoS("antiaffinity check: pod being evaluated not in node", "pod", klog.KObj(pod), "node", klog.KObjs(node))
continue
}
node, ok := nodeMap[pod.Spec.NodeName]
Normally, the descheduler is expected to check only a pod that has been scheduled. Which implies nodeMap[pod.Spec.NodeName]
will always exist. Though, I can see how CheckPodsWithAntiAffinityExist
can be functionality tested over a pod that has not been scheduled yet.
the descheduler is expected to check only a pod that has been scheduled. Which implies nodeMap[pod.Spec.NodeName] will always exist.
Hm not sure we are seeing the same thing. Taking a step back, we know that NodeFit
should be able to check whether P1, which is on N1, can be scheduled on another node, N2. With this anti-affinity check, it should check if there is an existing Pod on N2, P2, which has anti-affinity with P1.
Currently, the code will check whether P1 and P2 are on N2. Of course, we know that P1 is not yet on N2, so node, ok := nodeMap[pod.Spec.NodeName]
will always return !ok
and then return early (continue
). Therefore, we don't want to check this if we are coming from NodeFit
where P1 is not on N2 (yet).
I think the issue is that this was a leftover/oversight from my part when I was refactoring this in https://github.com/kubernetes-sigs/descheduler/pull/1356 where I tried to refactor this anti-affinity check from RemovePodsViolatingInterPodAntiAffinit. That plugin was comparing P1 and P2 when they are already both on the same Node. On the other hand, when coming from NodeFit
, it should be able to check when P1 is on N1 and P2 is on N2. I hope this makes some sense 🙈
Currently, the code will check whether P1 and P2 are on N2. Of course, we know that P1 is not yet on N2, so node, ok := nodeMap[pod.Spec.NodeName] will always return !ok and then return early (continue). Therefore, we don't want to check this if we are coming from NodeFit where P1 is not on N2 (yet).
P1 is scheduled on N1 so I wonder why you get !ok when accessing nodeMap[P1.Spec.NodeName]
P1 is scheduled on N1 so I wonder why you get !ok when accessing nodeMap[P1.Spec.NodeName]
Because the Node that we are checking for in nodeMap
is in fact N2! It's the destination Node that we are evaluating against, not the origin Node...
It's checking if P1 is on N2 (which is not expected to be there yet, therefore it returns early) and then if P2 is on N2 (which we expect it to be).
At least, this is my understanding - I could be wrong.
This is my understanding, let me know if this is different from yours
/retest
@fanhaouu apologies for pushing the other commit. I mis-clicked.
@fanhaouu apologies for pushing the other commit. I mis-clicked.
haha, let me resubmit it.
/retest
I made adjustments, to no longer pursue reuse CheckPodsWithAntiAffinityExist method, but realized podMatchesInterPodAntiAffinity method alone, at the same time by adjusting part of the method of parameter names and internal logic. @ingvagabund @nikimanoledaki Looking forward to a re-review and merging as soon as possible to avoid impacting users.
Very nice!!! @fanhaouu thank you for your patience and improvements. /approve Leaving the final lgtm for @nikimanoledaki
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: ingvagabund
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/lgtm
Currently, the "podMatchesInterPodAntiAffinity" method has an issue. It is unable to check if the pod matches the anti-affinity rule of another pod that is already on the given node.
The root cause is that the "CheckPodsWithAntiAffinityExist" method in "predicate.go" can only check if pods with AntiAffinity issues exist on the same node
This pr can resolve that.