kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.38k stars 662 forks source link

RemovePodsViolatingTopologySpreadConstraint evicting pods without topologySpreadConstraints #1167

Closed sapslaj closed 1 year ago

sapslaj commented 1 year ago

What version of descheduler are you using?

descheduler version: 0.27.1

Installed via Helm chart with values:

values.yaml
```yaml kind: Deployment cmdOptions: v: 4 deschedulerPolicy: evictSystemCriticalPods: true evictLocalStoragePods: true strategies: RemoveDuplicates: params: # Avoid accidentially evicting running Job pods excludeOwnerKinds: [Job] LowNodeUtilization: # Disabled since Karpenter takes care of this enabled: false RemoveFailedPods: enabled: true params: minPodLifetimeSeconds: 604800 # 1 week affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: # Avoid running on Karpenter-provisioned nodes - key: karpenter.sh/provisioner-name operator: DoesNotExist service: enabled: true serviceMonitor: enabled: true ```

Does this issue reproduce with the latest release?

Yes.

Which descheduler CLI options are you using?

command:
- /bin/descheduler
args:
- --policy-config-file
- /policy-dir/policy.yaml
- --descheduling-interval
- 5m
- --v
- "4"

Please provide a copy of your descheduler policy config file

policy.yaml
```yaml apiVersion: "descheduler/v1alpha1" kind: "DeschedulerPolicy" evictLocalStoragePods: true evictSystemCriticalPods: true strategies: LowNodeUtilization: enabled: false params: nodeResourceUtilizationThresholds: targetThresholds: cpu: 50 memory: 50 pods: 50 thresholds: cpu: 20 memory: 20 pods: 20 RemoveDuplicates: enabled: true params: excludeOwnerKinds: - Job RemoveFailedPods: enabled: true params: minPodLifetimeSeconds: 604800 RemovePodsHavingTooManyRestarts: enabled: true params: podsHavingTooManyRestarts: includingInitContainers: true podRestartThreshold: 100 RemovePodsViolatingInterPodAntiAffinity: enabled: true RemovePodsViolatingNodeAffinity: enabled: true params: nodeAffinityType: - requiredDuringSchedulingIgnoredDuringExecution RemovePodsViolatingNodeTaints: enabled: true RemovePodsViolatingTopologySpreadConstraint: enabled: true params: includeSoftConstraints: false ```

What k8s version are you using (kubectl version)?

kubectl version Output
$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.1", GitCommit:"4c9411232e10168d7b050c49a1b59f6df9d7ea4b", GitTreeState:"clean", BuildDate:"2023-04-14T13:14:41Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.9-eks-0a21954", GitCommit:"eb82cd845d007ae98d215744675dcf7ff024a5a3", GitTreeState:"clean", BuildDate:"2023-04-15T00:37:59Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.25) exceeds the supported minor version skew of +/-

What did you do?

Our cluster has a number of Jobs, some triggered via CronJobs, others created via the Kubernetes API directly (such as Argo CD hooks).

An example job looks somewhat like this:

job.yaml
```yaml apiVersion: batch/v1 kind: Job metadata: annotations: argocd.argoproj.io/hook: PreSync argocd.argoproj.io/hook-delete-policy: HookSucceeded argocd.argoproj.io/tracking-id: psg-facts-production:batch/Job:apps/eso-syncer-psg-facts-production batch.kubernetes.io/job-tracking: "" labels: controller-uid: 0cd165de-9ec8-4b5f-9c6e-a61a6d656a9e job-name: eso-syncer-psg-facts-production name: eso-syncer-psg-facts-production namespace: apps spec: backoffLimit: 2 completionMode: NonIndexed completions: 1 parallelism: 1 selector: matchLabels: controller-uid: 0cd165de-9ec8-4b5f-9c6e-a61a6d656a9e suspend: false template: metadata: creationTimestamp: null labels: controller-uid: 0cd165de-9ec8-4b5f-9c6e-a61a6d656a9e job-name: eso-syncer-psg-facts-production spec: automountServiceAccountToken: true containers: - env: - name: NAMESPACE valueFrom: configMapKeyRef: key: NAMESPACE name: eso-syncer - name: SLEEP_REFRESH_TIME_SECONDS valueFrom: configMapKeyRef: key: SLEEP_REFRESH_TIME_SECONDS name: eso-syncer image: 000000000000.dkr.ecr.us-west-2.amazonaws.com/eso-syncer:2.0.3 imagePullPolicy: IfNotPresent name: eso-syncer resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - all readOnlyRootFilesystem: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Never schedulerName: default-scheduler securityContext: runAsNonRoot: true runAsUser: 10001 seccompProfile: type: RuntimeDefault serviceAccount: eso-syncer serviceAccountName: eso-syncer terminationGracePeriodSeconds: 30 ttlSecondsAfterFinished: 100 ```

And the resulting Pod looks something like this: (note the lack of toplogySpreadConstraints or anything affinity-related)

pod.yaml
```yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: "2023-06-08T19:31:18Z" finalizers: - batch.kubernetes.io/job-tracking generateName: eso-syncer-psg-facts-production- labels: controller-uid: adc750aa-43a3-415b-9ece-e3d8356366c8 job-name: eso-syncer-psg-facts-production name: eso-syncer-psg-facts-production-rm6j7 namespace: apps ownerReferences: - apiVersion: batch/v1 blockOwnerDeletion: true controller: true kind: Job name: eso-syncer-psg-facts-production uid: adc750aa-43a3-415b-9ece-e3d8356366c8 resourceVersion: "122262821" uid: 99081296-f471-44cf-b827-49821c30bb49 spec: automountServiceAccountToken: true containers: - env: # note: `DD_` environment variables are injected from a mutating webhook. - name: DD_ENTITY_ID valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.uid - name: DD_AGENT_HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.hostIP - name: NAMESPACE valueFrom: configMapKeyRef: key: NAMESPACE name: eso-syncer - name: SLEEP_REFRESH_TIME_SECONDS valueFrom: configMapKeyRef: key: SLEEP_REFRESH_TIME_SECONDS name: eso-syncer image: 000000000000.dkr.ecr.us-west-2.amazonaws.com/eso-syncer:2.0.3 imagePullPolicy: IfNotPresent name: eso-syncer resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - all readOnlyRootFilesystem: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-dvd2l readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true nodeName: ip-10-160-36-145.us-west-2.compute.internal preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Never schedulerName: default-scheduler securityContext: runAsNonRoot: true runAsUser: 10001 seccompProfile: type: RuntimeDefault serviceAccount: eso-syncer serviceAccountName: eso-syncer terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - name: kube-api-access-dvd2l projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace ```

What did you expect to see?

Since the resulting Pod does not have any toplogySpreadConstraints I would expect Descheduler to leave it alone.

What did you see instead?

Pods are getting evicted by Descheduler:

I0608 19:03:49.190705       1 evictions.go:162] "Evicted pod" pod="apps/eso-syncer-psg-facts-production-rm6j7" reason="" strategy="RemovePodsViolatingTopologySpreadConstraint" node="ip-10-160-36-145.us-west-2.compute.internal"

This causes the Job to re-schedule the pod. If the Pod doesn't complete in time for Descheduler's next cycle it will get evicted again.

It's interesting that seems to be the opposite problem of #1138.

damemi commented 1 year ago

@sapslaj thanks for reporting this. Are there any other pods in the namespace that do have topology spread constraints?

sapslaj commented 1 year ago

Hey @damemi. Yes there are other pods in that namespace with topology spread constraints. All of them are some variation of this:

topologySpreadConstraints:
- labelSelector:
  matchExpressions:
  - key: app
    operator: In
    values:
    - psg-facts # or other app name

Since the jobs don't have an app label I would expect them not to match.

That's a good catch though. I didn't realize that Descheduler takes all topology spread constraints into account until I checked the source code.

damemi commented 1 year ago

Yeah, for reference that's intended to be in line with how the scheduler treats labels in topology spread constraints:

labelSelector is used to find matching Pods. Pods that match this label selector are counted to determine the number of Pods in their corresponding topology domain.

Though maybe it's ambiguous what "their corresponding topology domain" means (the same node? does it need a topology constraint?) We should check with the scheduling sig on this

It doesn't seem right that the pods are being evicted without a matching label though. I'll try adding a test to reproduce this, thanks for providing all of this info @sapslaj

sapslaj commented 1 year ago

Awesome, thanks for confirming my suspicions about it being strange behavior. I'll keep digging on my end as well and see if I can uncover anything notable with our environment.

sapslaj commented 1 year ago

I've been doing quite a bit of testing and now I can't seem to reproduce the exact issue. However, we did find quite a few jobs that did have labels matching topology spread constraints and were just generally misconfigured. We've been working with the teams on our end who own those applications to get those fixed and now we're not seeing that issue anymore. I'm still not totally sure why but maybe it was some kind of edge case/side effect?

Anyway, @damemi if you can't reproduce the issue then maybe we should just close this since I can't reproduce it anymore.

damemi commented 1 year ago

Thanks for the update @sapslaj. Unfortunately I haven't had a chance to look more into this, but we should make sure our tests do at least cover cases like this if they don't already.

I'm going to close this since it sounds like the misconfigured jobs were more likely your issue. If you hit it again, please feel free to reopen with more info. Thanks! /close

k8s-ci-robot commented 1 year ago

@damemi: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/descheduler/issues/1167#issuecomment-1591712101): >Thanks for the update @sapslaj. Unfortunately I haven't had a chance to look more into this, but we should make sure our tests do at least cover cases like this if they don't already. > >I'm going to close this since it sounds like the misconfigured jobs were more likely your issue. If you hit it again, please feel free to reopen with more info. Thanks! >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.