kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.49k stars 669 forks source link

RemovePodsViolatingTopologySpreadConstraint does not work for cluster-level default constraints #1502

Open odsevs opened 2 months ago

odsevs commented 2 months ago

What version of descheduler are you using?

descheduler version: 0.30.1

Does this issue reproduce with the latest release?

yes

Which descheduler CLI options are you using?

I installed the helm chart as-is with the values described below.

Please provide a copy of your descheduler policy config file

# Default values for descheduler.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

# CronJob or Deployment
kind: Deployment

image:
  repository: registry.k8s.io/descheduler/descheduler
  # Overrides the image tag whose default is the chart version
  tag: ""
  pullPolicy: IfNotPresent

imagePullSecrets:
#   - name: container-registry-secret

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  # limits:
  #   cpu: 100m
  #   memory: 128Mi

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  privileged: false
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

# podSecurityContext -- [Security context for pod](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/)
podSecurityContext: {}
  # fsGroup: 1000

nameOverride: ""
fullnameOverride: ""

# labels that'll be applied to all resources
commonLabels: {}

cronJobApiVersion: "batch/v1"
schedule: "*/2 * * * *"
suspend: false
# startingDeadlineSeconds: 200
# successfulJobsHistoryLimit: 3
# failedJobsHistoryLimit: 1
# ttlSecondsAfterFinished 600
# timeZone: Etc/UTC

# Required when running as a Deployment
deschedulingInterval: 1m

# Specifies the replica count for Deployment
# Set leaderElection if you want to use more than 1 replica
# Set affinity.podAntiAffinity rule if you want to schedule onto a node
# only if that node is in the same zone as at least one already-running descheduler
replicas: 1

# Specifies whether Leader Election resources should be created
# Required when running as a Deployment
# NOTE: Leader election can't be activated if DryRun enabled
leaderElection: {}
#  enabled: true
#  leaseDuration: 15s
#  renewDeadline: 10s
#  retryPeriod: 2s
#  resourceLock: "leases"
#  resourceName: "descheduler"
#  resourceNamescape: "kube-system"

command:
- "/bin/descheduler"

cmdOptions:
  v: 3

# Recommended to use the latest Policy API version supported by the Descheduler app version
deschedulerPolicyAPIVersion: "descheduler/v1alpha2"

# deschedulerPolicy contains the policies the descheduler will execute.
# To use policies stored in an existing configMap use:
# NOTE: The name of the cm should comply to {{ template "descheduler.fullname" . }}
# deschedulerPolicy: {}
deschedulerPolicy:
  # nodeSelector: "key1=value1,key2=value2"
  # maxNoOfPodsToEvictPerNode: 10
  # maxNoOfPodsToEvictPerNamespace: 10
  # ignorePvcPods: true
  # evictLocalStoragePods: true
  # evictDaemonSetPods: true
  # tracing:
  #   collectorEndpoint: otel-collector.observability.svc.cluster.local:4317
  #   transportCert: ""
  #   serviceName: ""
  #   serviceNamespace: ""
  #   sampleRate: 1.0
  #   fallbackToNoOpProviderOnError: true
  profiles:
    - name: default
      pluginConfig:
        - name: DefaultEvictor
          args:
            ignorePvcPods: true
            evictLocalStoragePods: false
        - name: RemoveDuplicates
        - name: RemovePodsHavingTooManyRestarts
          args:
            podRestartThreshold: 100
            includingInitContainers: true
        - name: RemovePodsViolatingNodeAffinity
          args:
            nodeAffinityType:
            - requiredDuringSchedulingIgnoredDuringExecution
        - name: RemovePodsViolatingNodeTaints
        - name: RemovePodsViolatingInterPodAntiAffinity
        - name: RemovePodsViolatingTopologySpreadConstraint
        - name: LowNodeUtilization
          args:
            thresholds:
              cpu: 65
              memory: 60
              pods: 30
            targetThresholds:
              cpu: 85
              memory: 80
              pods: 40
      plugins:
        balance:
          enabled:
#            - RemoveDuplicates
            - RemovePodsViolatingTopologySpreadConstraint
#            - LowNodeUtilization
        deschedule:
          enabled:
#            - RemovePodsHavingTooManyRestarts
#            - RemovePodsViolatingNodeTaints
#            - RemovePodsViolatingNodeAffinity
#            - RemovePodsViolatingInterPodAntiAffinity

priorityClassName: system-cluster-critical

nodeSelector: {}
#  foo: bar

affinity: {}
# nodeAffinity:
#   requiredDuringSchedulingIgnoredDuringExecution:
#     nodeSelectorTerms:
#     - matchExpressions:
#       - key: kubernetes.io/e2e-az-name
#         operator: In
#         values:
#         - e2e-az1
#         - e2e-az2
#  podAntiAffinity:
#    requiredDuringSchedulingIgnoredDuringExecution:
#      - labelSelector:
#          matchExpressions:
#            - key: app.kubernetes.io/name
#              operator: In
#              values:
#                - descheduler
#        topologyKey: "kubernetes.io/hostname"
topologySpreadConstraints: []
# - maxSkew: 1
#   topologyKey: kubernetes.io/hostname
#   whenUnsatisfiable: DoNotSchedule
#   labelSelector:
#     matchLabels:
#       app.kubernetes.io/name: descheduler
tolerations: []
# - key: 'management'
#   operator: 'Equal'
#   value: 'tool'
#   effect: 'NoSchedule'

rbac:
  # Specifies whether RBAC resources should be created
  create: true

serviceAccount:
  # Specifies whether a ServiceAccount should be created
  create: true
  # The name of the ServiceAccount to use.
  # If not set and create is true, a name is generated using the fullname template
  name:
  # Specifies custom annotations for the serviceAccount
  annotations: {}

podAnnotations: {}

podLabels: {}

dnsConfig: {}

livenessProbe:
  failureThreshold: 3
  httpGet:
    path: /healthz
    port: 10258
    scheme: HTTPS
  initialDelaySeconds: 3
  periodSeconds: 10

service:
  enabled: false
  # @param service.ipFamilyPolicy [string], support SingleStack, PreferDualStack and RequireDualStack
  #
  ipFamilyPolicy: ""
  # @param service.ipFamilies [array] List of IP families (e.g. IPv4, IPv6) assigned to the service.
  # Ref: https://kubernetes.io/docs/concepts/services-networking/dual-stack/
  # E.g.
  # ipFamilies:
  #   - IPv6
  #   - IPv4
  ipFamilies: []

serviceMonitor:
  enabled: false
  # The namespace where Prometheus expects to find service monitors.
  # namespace: ""
  # Add custom labels to the ServiceMonitor resource
  additionalLabels: {}
    # prometheus: kube-prometheus-stack
  interval: ""
  # honorLabels: true
  insecureSkipVerify: true
  serverName: null
  metricRelabelings: []
    # - action: keep
    #   regex: 'descheduler_(build_info|pods_evicted)'
    #   sourceLabels: [__name__]
  relabelings: []
    # - sourceLabels: [__meta_kubernetes_pod_node_name]
    #   separator: ;
    #   regex: ^(.*)$
    #   targetLabel: nodename
    #   replacement: $1
    #   action: replace

What k8s version are you using (kubectl version)?

kubectl version v1.28.11
$ kubectl version
```
Client Version: v1.28.11+rke2r1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.11+rke2r1
```

What did you do?

I defined a default topology spread constraint for the cluster, as described here: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#cluster-level-default-constraints

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /var/lib/rancher/rke2/server/cred/scheduler.kubeconfig
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: PodTopologySpread
        args:
          defaultConstraints:
            - maxSkew: 1
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: DoNotSchedule
              nodeTaintsPolicy: Honor
          defaultingType: List

I then:

  1. Cordoned one of the three nodes in the cluster
  2. Created a deployment of 12 pods. Thanks to the KubeSchedulerConfiguration, I got 6 pods on each of the other two nodes.
  3. Uncordoned the node
  4. I installed descheduler with the config provided above

What did you expect to see? Descheduler kicks in to enforce the default pod topology constraint

What did you see instead? Descheduler does nothing. In the logs:

I0827 15:15:31.256918       1 descheduler.go:156] Building a pod evictor
I0827 15:15:31.257002       1 topologyspreadconstraint.go:122] Processing namespaces for topology spread constraints
I0827 15:15:31.257732       1 profile.go:349] "Total number of pods evicted" extension point="Balance" evictedPods=0
I0827 15:15:31.257776       1 descheduler.go:170] "Number of evicted pods" totalEvicted=0
damemi commented 2 months ago

Hi, this is working as intended. The descheduler does not know about cluster-level default constraints because it does not have access to the cluster's KubeSchedulerConfiguration.

damemi commented 2 months ago

/remove-kind bug /kind feature

odsevs commented 2 months ago

Thanks for the reply. It's a strange design decision that the KubeSchedulerConfiguration is not accessible from within the cluster, but indeed in that case, there is not much you can do. I assume, if this would become a feature, that this configuration would unfortunately need to be duplicated between the kube-scheduler and the descheduler.

damemi commented 2 months ago

Yes, exactly. This has been discussed before if you'd like some more info: https://github.com/kubernetes-sigs/descheduler/issues/609

Essentially the KubeSchedulerConfiguration isn't an actual API object, it's just mounted to the scheduler pod. Users can also deploy custom or multiple schedulers so there isn't a good way to automatically know which default to go with.