Open jinglinliang opened 2 weeks ago
Hi @jinglinliang. The issue description looks quite awesome. I love the snapshot picture. I wished there were more such reports :).
Wrt. either of the node utilization strategies it's ultimately up to the scheduler to make the switch. The descheduler plugins might evict some of the non-anti-affinity pods. Yet, these non-anti-affinity pods need to first get scheduled to where the blue pods are. Running LowNodeUtilization
might help with that. Depending on the pod's priorities and preeviction filters. Once freed enough running HighNodeUtilization
might evict some of the blue pods. A kind of "shaking the nodes" hoping the scheduler will re-distribute the pods towards what's requested here. Yet, it's an iterative process that does not guarantee a success. To provide a real swap kubelets need to have ability to allocate slots.
I presume preemption and priorities does not help since both blue and green pods have the same or very similar priority? I.e. HighNodeUtilization
plugin (with nodeFit disabled) evicting blue pods and having the kube-scheduler preempt green pods to free space for blue ones.
With the profiles you can configure something like:
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: Round1Low
# timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
pluginConfig:
- name: "LowNodeUtilization"
args:
thresholds:
"memory": 20
targetThresholds:
"memory": 70
- name: "DefaultEvictor"
args:
... # evict only green pods
plugins:
balance:
enabled:
- "LowNodeUtilization"
- name: Round1High
# timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
pluginConfig:
- name: "HighNodeUtilization"
args:
thresholds:
"memory": 20
- name: "DefaultEvictor"
args:
... # evict only blue pods
plugins:
balance:
enabled:
- "HighNodeUtilization"
- name: Round2Low
# timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
pluginConfig:
- name: "LowNodeUtilization"
args:
thresholds:
"memory": 20
targetThresholds:
"memory": 70
- name: "DefaultEvictor"
args:
... # evict only green pods
plugins:
balance:
enabled:
- "LowNodeUtilization"
- name: Round2High
pluginConfig:
- name: "HighNodeUtilization"
args:
thresholds:
"memory": 20
- name: "DefaultEvictor"
args:
... # evict only blue pods
plugins:
balance:
enabled:
- "HighNodeUtilization"
...
Perform the shaking multiple times. Yet, the current descheduler will be quite quick in evicting pods. So we'd have to implement a timeout between profiles that will wait e.g. for 1 minute (user configured) before "shaking the nodes" again.
Hi @ingvagabund. Thank you very much for the reply :)
Some clarifications:
"Shaking the nodes" is an interesting idea but seems very unpredictable.
- name: "DefaultEvictor"
args:
... # evict only blue pods
It would be difficult to define the "blue" or "green" pods here. Also, the clusters may just enter a stable state based on the profile when all nodes are, for example, 50% allocated, and the total number of nodes stays the same as the snapshot. (please correct me if i'm wrong)
Another idea we had is to set the HighNodeUtilization
threshhold, or similarly, Cluster Autoscaler's scale-down threshhold to 100%, so that the non-blue nodes on the right side will be torn down and pods will stack on top of the blues ones. However, this could cause lots of unnecessary pod disruptions
Some of our clusters have some small anti-affinity deployments and are causing lots of fragmentations. Here's a snapshot of one cluster the blue deployment has anti-affinity, and the cluster ended up in this state after the blue deployment restarts
I'm poking around solutions to alleviate this situation.
First, cluster autoscaler (CAS) is not scaling down those low utilization nodes because none of the blue pods can fit into the rest of the nodes, which are pretty much fully packed.
I came across the
HighNodeUtilization
&LowNodeUtilization
plugins in de-scheduler but looks like the eviction logic is similar to CAS.I'm wondering if it's possible to implement or use existing de-scheduler plugins to achieve some kind of swap function, which swaps the blue pods with the non-anti-affinity pods in other nodes, so that each of the fully packed nodes can have one blue pod. And the swapped out non-anti-affinity pods can be packed into much fewer nodes.
Any ideas are appreciated!