Swap pods to reduce fragmentation

jinglinliang commented 2 weeks ago

Some of our clusters have some small anti-affinity deployments and are causing lots of fragmentations. Here's a snapshot of one cluster the blue deployment has anti-affinity, and the cluster ended up in this state after the blue deployment restarts

I'm poking around solutions to alleviate this situation.

First, cluster autoscaler (CAS) is not scaling down those low utilization nodes because none of the blue pods can fit into the rest of the nodes, which are pretty much fully packed.

I came across the HighNodeUtilization & LowNodeUtilization plugins in de-scheduler but looks like the eviction logic is similar to CAS.

I'm wondering if it's possible to implement or use existing de-scheduler plugins to achieve some kind of swap function, which swaps the blue pods with the non-anti-affinity pods in other nodes, so that each of the fully packed nodes can have one blue pod. And the swapped out non-anti-affinity pods can be packed into much fewer nodes.

Any ideas are appreciated!

ingvagabund commented 1 week ago

Hi @jinglinliang. The issue description looks quite awesome. I love the snapshot picture. I wished there were more such reports :).

Wrt. either of the node utilization strategies it's ultimately up to the scheduler to make the switch. The descheduler plugins might evict some of the non-anti-affinity pods. Yet, these non-anti-affinity pods need to first get scheduled to where the blue pods are. Running LowNodeUtilization might help with that. Depending on the pod's priorities and preeviction filters. Once freed enough running HighNodeUtilization might evict some of the blue pods. A kind of "shaking the nodes" hoping the scheduler will re-distribute the pods towards what's requested here. Yet, it's an iterative process that does not guarantee a success. To provide a real swap kubelets need to have ability to allocate slots.

I presume preemption and priorities does not help since both blue and green pods have the same or very similar priority? I.e. HighNodeUtilization plugin (with nodeFit disabled) evicting blue pods and having the kube-scheduler preempt green pods to free space for blue ones.

With the profiles you can configure something like:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: Round1Low
    # timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
    pluginConfig:
    - name: "LowNodeUtilization"
      args:
        thresholds:
          "memory": 20
        targetThresholds:
          "memory": 70
    - name: "DefaultEvictor"
      args:
        ... # evict only green pods
    plugins:
      balance:
        enabled:
          - "LowNodeUtilization"
  - name: Round1High
    # timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
    pluginConfig:
    - name: "HighNodeUtilization"
      args:
        thresholds:
          "memory": 20
    - name: "DefaultEvictor"
      args:
        ... # evict only blue pods
    plugins:
      balance:
        enabled:
          - "HighNodeUtilization"
  - name: Round2Low
    # timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
    pluginConfig:
    - name: "LowNodeUtilization"
      args:
        thresholds:
          "memory": 20
        targetThresholds:
          "memory": 70
    - name: "DefaultEvictor"
      args:
        ... # evict only green pods
    plugins:
      balance:
        enabled:
          - "LowNodeUtilization"
  - name: Round2High
    pluginConfig:
    - name: "HighNodeUtilization"
      args:
        thresholds:
          "memory": 20
    - name: "DefaultEvictor"
      args:
        ... # evict only blue pods
    plugins:
      balance:
        enabled:
          - "HighNodeUtilization"
...

Perform the shaking multiple times. Yet, the current descheduler will be quite quick in evicting pods. So we'd have to implement a timeout between profiles that will wait e.g. for 1 minute (user configured) before "shaking the nodes" again.

jinglinliang commented 1 week ago

Hi @ingvagabund. Thank you very much for the reply :)

Some clarifications:

Priority does not help here. All deployments have the same priority.
Our goal is to pack each of our clusters as tight as possible, preferably with ~0 fragmentation cores
The "blue" and "green" pods here are just examples, we have hundreds of deployments spread across thousands of clusters, so we need the descheduler configuration to be generic.

"Shaking the nodes" is an interesting idea but seems very unpredictable.

- name: "DefaultEvictor"
      args:
        ... # evict only blue pods

It would be difficult to define the "blue" or "green" pods here. Also, the clusters may just enter a stable state based on the profile when all nodes are, for example, 50% allocated, and the total number of nodes stays the same as the snapshot. (please correct me if i'm wrong)

Another idea we had is to set the HighNodeUtilization threshhold, or similarly, Cluster Autoscaler's scale-down threshhold to 100%, so that the non-blue nodes on the right side will be torn down and pods will stack on top of the blues ones. However, this could cause lots of unnecessary pod disruptions

kubernetes-sigs / descheduler

Swap pods to reduce fragmentation #1519