Pod Anti-Affinity prevents scale up, requires manual pod deletion

kubernetes / autoscaler

Autoscaling components for Kubernetes

Apache License 2.0

7.97k stars 3.94k forks source link

Pod Anti-Affinity prevents scale up, requires manual pod deletion #5741

Open alvaroaleman opened 1 year ago

alvaroaleman commented 1 year ago

Which component are you using?:

Cluster-Autoscaler

What version of the component are you using?:

Component version:

registry.k8s.io/autoscaling/cluster-autoscaler:v1.23.1

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.16-eks-48e63af", GitCommit:"e6332a8a3feb9e0fe3db851878f88cb73d49dd7a", GitTreeState:"clean", BuildDate:"2023-01-24T19:18:15Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"

What environment is this in?:

EKS

What did you expect to happen?:

Scale-Up

What happened instead?:

Pod a has a node-level anti-affinity to Pod b. Pod b is running, pod-a is pending. Cluster-Autoscaler fails to scale up:

pod didn't trigger scale-up: 2 node(s) didn't find available persistent volumes to bind, 6 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod anti-affinity rules

Manually deleting pod b triggered a scale up and then both pod a and b can be scheduled.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

arkbriar commented 8 months ago

Hi, I also ran into the same problem recently. I checked the code and guess that was because the auto-scaler checks the predicates with all pods on a sampled node of a node group. That could explain why deleting the existing pod could help.

BTW, I got the following error in my case:

Pod default/test can't be scheduled on eks-xxxxxx-node-group, predicate checking error
: node(s) didn't match pod anti-affinity rules; predicateName=InterPodAffinity; reasons: node(s) didn't match pod anti-affinity rules; debugInfo=

I wonder if testing against DaemonSet pods only would help.

arkbriar commented 8 months ago

Hi, I also ran into the same problem recently. I checked the code and guess that was because the auto-scaler checks the predicates with all pods on a sampled node of a node group. That could explain why deleting the existing pod could help.

BTW, I got the following error in my case:
Pod default/test can't be scheduled on eks-xxxxxx-node-group, predicate checking error
: node(s) didn't match pod anti-affinity rules; predicateName=InterPodAffinity; reasons: node(s) didn't match pod anti-affinity rules; debugInfo=
I wonder if testing against DaemonSet pods only would help.

Never mind. I misunderstood the effect of operator NotInand the pod conflicts with DaemonSet pods. After adding another expression to ensure that the label exists with Exists, the pod can trigger scale-up again.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rluan commented 2 months ago

same issue here. wait for answers ...

AWS EKS v1.29

kubectl version Client Version: v1.29.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.6-eks-db838b0

autoscaler version: v1.29.2 autoscaler mode: cluster-autoscaler-autodiscover.yaml autoscaler error logs:

I0718 06:07:02.432263 1 orchestrator.go:542] Pod ingress-nginx/ingress-nginx-controller-65f45474f6-nqgjt can't be scheduled on eksctl-eng-us1p-eks-nodegroup-proxy-v1-NodeGroup-cviXU8g2vYQq, predicate checking error: node(s) didn't match pod anti-affinity rules; predicateName=InterPodAffinity; reasons: node(s) didn't match pod anti-affinity rules; debugInfo=

it just works for the nodegroup/deployment that not has antiaffinity snippets.... weird ...

$ diff ingress-nginx-controller-v1.10.1.yaml ingress-nginx-controller-v1.10.1.yaml-testnoantiaffinity 443,452d442 < affinity: < podAntiAffinity: < requiredDuringSchedulingIgnoredDuringExecution: < - labelSelector: < matchExpressions: < - key: app.kubernetes.io/component < operator: In < values: < - controller < topologyKey: topology.kubernetes.io/zone

so as you can see the only different in the manifest, when I do kubectl apply -f ingress-nginx-controller-v1.10.1.yaml-testnoantiaffinity the autoscaler is just worked as expected.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

almson commented 3 weeks ago

Basically, you need to run one ASG per AZ. This solves a number of problems, including this one. Otherwise, you can fork cluster-autoscaler and apply this patch:

diff --git a/cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go b/cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go
index 7fb533570..18e8adca9 100644
--- a/cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go
+++ b/cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go
@@ -604,7 +604,10 @@ func (o *ScaleUpOrchestrator) SchedulablePodGroups(
        var schedulablePodGroups []estimator.PodEquivalenceGroup
        for _, eg := range podEquivalenceGroups {
                samplePod := eg.Pods[0]
-               if err := o.autoscalingContext.PredicateChecker.CheckPredicates(o.autoscalingContext.ClusterSnapshot, samplePod, nodeInfo.Node().Name); err == nil {
+               err := o.autoscalingContext.PredicateChecker.CheckPredicates(o.autoscalingContext.ClusterSnapshot, samplePod, nodeInfo.Node().Name)
+               // Ignore inter-pod affinity since this is not usually a reason to fail the ASG
+               // (unless the anti-affinity conflict is with a DaemonSet pod, but there's no way to tell)
+               if err == nil || err.PredicateName() == "InterPodAffinity" {
                        // Add pods to option.
                        schedulablePodGroups = append(schedulablePodGroups, estimator.PodEquivalenceGroup{
                                Pods: eg.Pods,