kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.23k stars 645 forks source link

Descheduler Pod stuck in crashloopback #1392

Open devops-inthe-east opened 1 month ago

devops-inthe-east commented 1 month ago

Hey Folks,

Descheduler version : 0.29.0

My environment has EKS 1.27 that utilizes this component.

It was working as expected, however once I reduced my worker nodes from 4 to 2 nodes the descheduler pod gets stuck in crashloopback state & throws this error :

**"The cluster size is 0 or 1 meaning eviction causes service disruption or degradation. So aborting.."**

I am unsure what this error msg would mean, as I still have 2 worker nodes active.

Complete logs

I0508 11:47:31.856171 1 secure_serving.go:57] Forcing use of http/1.1 only 29 I0508 11:47:31.856645 1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1715168851\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1715168851\" (2024-05-08 10:47:31 +0000 UTC to 2025-05-08 10:47:31 +0000 UTC (now=2024-05-08 11:47:31.856607744 +0000 UTC))" 28 I0508 11:47:31.856681 1 secure_serving.go:213] Serving securely on [::]:10258 27 I0508 11:47:31.856695 1 tracing.go:87] Did not find a trace collector endpoint defined. Switching to NoopTraceProvider 26 I0508 11:47:31.856771 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" 25 I0508 11:47:31.857828 1 conversion.go:257] converting Deschedule plugin: RemovePodsViolatingNodeAffinity 24 I0508 11:47:31.857848 1 conversion.go:257] converting Deschedule plugin: RemovePodsViolatingNodeTaints 23 I0508 11:47:31.857859 1 conversion.go:248] converting Balance plugin: RemovePodsViolatingTopologySpreadConstraint 22 I0508 11:47:31.857885 1 conversion.go:248] converting Balance plugin: LowNodeUtilization 21 I0508 11:47:31.857902 1 conversion.go:248] converting Balance plugin: RemoveDuplicates 20 I0508 11:47:31.857925 1 conversion.go:257] converting Deschedule plugin: RemovePodsHavingTooManyRestarts 19 I0508 11:47:31.857939 1 conversion.go:257] converting Deschedule plugin: RemovePodsViolatingInterPodAntiAffinity 18 W0508 11:47:31.866856 1 descheduler.go:246] failed to convert Descheduler minor version to float 17 I0508 11:47:31.895468 1 reflector.go:289] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:159 16 I0508 11:47:31.895475 1 reflector.go:289] Starting reflector *v1.PriorityClass (0s) from k8s.io/client-go/informers/factory.go:159 15 I0508 11:47:31.895504 1 reflector.go:325] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:159 14 I0508 11:47:31.895506 1 reflector.go:325] Listing and watching *v1.PriorityClass from k8s.io/client-go/informers/factory.go:159 13 I0508 11:47:31.895520 1 reflector.go:289] Starting reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:159 12 I0508 11:47:31.895535 1 reflector.go:325] Listing and watching *v1.Namespace from k8s.io/client-go/informers/factory.go:159 11 I0508 11:47:31.895576 1 reflector.go:289] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:159 10 I0508 11:47:31.895596 1 reflector.go:325] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:159 9 I0508 11:47:31.898936 1 reflector.go:351] Caches populated for *v1.PriorityClass from k8s.io/client-go/informers/factory.go:159 8 I0508 11:47:31.899083 1 reflector.go:351] Caches populated for *v1.Namespace from k8s.io/client-go/informers/factory.go:159 7 I0508 11:47:31.900057 1 reflector.go:351] Caches populated for *v1.Node from k8s.io/client-go/informers/factory.go:159 6 I0508 11:47:31.924478 1 reflector.go:351] Caches populated for *v1.Pod from k8s.io/client-go/informers/factory.go:159 5 I0508 11:47:31.996485 1 descheduler.go:120] **"The cluster size is 0 or 1 meaning eviction causes service disruption or degradation. So aborting.."** 4 E0508 11:47:31.996554 1 descheduler.go:430] the cluster size is 0 or 1 3 I0508 11:47:31.996718 1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController" 2 I0508 11:47:31.996759 1 secure_serving.go:258] Stopped listening on [::]:10258 1 I0508 11:47:31.996805 1 reflector.go:295] Stopping reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:159

JaneLiuL commented 1 month ago

could you execute this command and show me result? seems only 1 worker node in your cluster kubectl get nodes -o wide kubectl get nodes -o yaml

devops-inthe-east commented 1 month ago

Hi Jane, Thanks for your response.

My cluster node count had reduced suddenly by the CA I wasn't aware.

That gave me incorrect impression that I had maybe atleast 2 nodes.

While we are at this already. Maybe I know what may cause such a behaviour by the descheduler