Cluster Size check shouldn't exit process when running as Deployment

markdingram commented 1 year ago

Is your feature request related to a problem? Please describe.

When there are 0 or 1 nodes the descheduler loop returns error the cluster size is 0 or 1 & the process exits.

This doesn't play nicely when the descheduler is running as a Deployment - the pod goes into CrashLoopBackOff due to the repeated early exits.

Describe the solution you'd like

When the descheduler is running as a Deployment the cluster size is 0 or 1 check shouldn't exit the process. The process should remain running until the next iteration.

Something like (in runDeschedulerLoop):

    // if len is still <= 1 error out
    if len(nodes) <= 1 {
        klog.V(1).InfoS("The cluster size is 0 or 1 meaning eviction causes service disruption or degradation. So aborting..")
        if d.rs.DeschedulingInterval.Seconds() == 0 {
            return fmt.Errorf("the cluster size is 0 or 1")
        } else {
            return nil
        }
    }

Describe alternatives you've considered

What version of descheduler are you using?

descheduler version:

0.28.0

Additional context

Example logs


I1124 16:00:06.155189       1 node.go:50] "Node lister returned empty list, now fetch directly"
I1124 16:00:06.160342       1 descheduler.go:121] "The cluster size is 0 or 1 meaning eviction causes service disruption or degradation. So aborting.."
E1124 16:00:06.160453       1 descheduler.go:431] the cluster size is 0 or 1
I1124 16:00:06.160876       1 reflector.go:295] Stopping reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:150
I1124 16:00:06.160913       1 reflector.go:295] Stopping reflector *v1.PriorityClass (0s) from k8s.io/client-go/informers/factory.go:150
I1124 16:00:06.161268       1 secure_serving.go:255] Stopped listening on [::]:10258
I1124 16:00:06.161288       1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController"
I1124 16:00:06.161487       1 reflector.go:295] Stopping reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:150
I1124 16:00:06.161529       1 reflector.go:295] Stopping reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:150

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
...
  Normal   Pulled     50m (x4 over 51m)     kubelet            Container image "registry.k8s.io/descheduler/descheduler:v0.28.0" already present on machine
  Warning  BackOff    100s (x238 over 51m)  kubelet            Back-off restarting failed container

grzesuav commented 1 year ago

I have similar case, tbh I wanted to limit descheduler scope to just nodes with certain label, but with this behaviour this is not possible. I am planning to use descheduler with TaintBased eviction policy (evict pods not matching node taints) but additionally I have companion operator which sets certain label to the node.

I have work aoruneded that by adding node selector in policy directly, however this is suboptimal - I have over 1000 nodes in my cluster and I actually want descheduler to watch just few of them

grzesuav commented 1 year ago

I also found https://github.com/kubernetes-sigs/descheduler/issues/469 however not sure what was fixed in that issue

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 7 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/descheduler/issues/1298#issuecomment-2072196697): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

mm0 commented 3 months ago

I would say that this is more of a bug, as this affects all single node kubernetes clusters (minikube, docker desktop)

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes-sigs / descheduler

Cluster Size check shouldn't exit process when running as Deployment #1298