Open utterances-bot opened 1 year ago
Thank you for your great post, especially how to conversion between cpu share and cpu weight, couldn't find the formula before.
There are 2 things I feel confused.
For Guaranteed Pod example, why is "This specific configuration allows our processes to run every 0.2 seconds of every 1 second (1/5th)", it allows up to 200000 microseconds for every 100000, so shouldn't it be 2 seconds for every second?
For the example about different burstable pods competing CPU resources, burstable1 & burstable2, shouldn't they consume all CPU resources assuming there is no other processes competing CPU resources? specifically burstable 1 should be ~67% (=79/(79+39)) while burstable2 should be ~33% (=39/(79+39)), as cpu weight from other processes are irrelevant if they are not CPU hungry.
On top of that, I wonder should be sum kubepods.slice and kubepods-burstable.slice together, as they are in different hierarchy level, specifically kubepods-burstable.slice is under kubepods.slice. In other words, if the minimum cpu of kubepods-burstable.slice will be kubepods-burstable.slice/kubepods.slice when kubepods-besteffort.slice is also competing CPU resources
Reference: hierarchy I mentioned
Hey @JoeHO888:
Trying to explain a bit better the guaranteed pod confusion you mentioned:
The first value (200000) sets a quota of 200000 microseconds (or 0.2 seconds) for how long the process can run during a single period. The second value (100000) defines the period's length as 100000 microseconds (or 0.1 seconds).
With these values, the processes can run for up to 0.2 seconds during a 0.1-second period (1/5th of the period). Once they consume their time quota, they will be throttled and won't be allowed to run until the next period starts. This configuration effectively allows processes to run every 0.2 seconds of every 1 second (1/5th).
For the burstable pod example:
Keep in mind that pods running inside the same parent slice can compete for resources. In this situation, when they’re competing for resources the `total cpu.weight` will be the one from summing all their parent cgroup cpu weights.
Burstable1 and Burstable2 pods slices are created under the kubepods-burstable.slice.
The parent of kubepods-burstable.slice is kubepods.slice which means that in order to get total cpu.weight we need to sum weights for parents cgroups for the burstable pods.
kubepods-burstable.slice -> cpu.weight 86
kubdepods.slice -> cpu.weight 137
137+86 = 223
The 223 you see on the calculations comes from there.
And answering the last part of your question:
The link you shared refers to cgroupsv1. In cgroupsv2 you can see that kubepods-burstable.slice is a child of kubepods.slice
# systemd-cgls -u kubepods.slice | grep slice
Unit kubepods.slice (/kubepods.slice):
├─kubepods-pod2667544d_8e6c_4182_9460_5873bf631057.slice (#8120)
├─kubepods-burstable.slice (#6073)
│ ├─kubepods-burstable-pod2523e459_aefd_48ed_a1d2_389dd2ffc093.slice (#7042)
│ └─kubepods-burstable-pod3629b884_7f27_405b_a89f_5871767c585a.slice (#6219)
└─kubepods-besteffort.slice (#6146)
├─kubepods-besteffort-pod51ef24aa_7fcf_469e_a896_56e09066991c.slice (#6375)
├─kubepods-besteffort-podf63a96be_c56f_4880_8afb_9b321db3df61.slice (#6302)
├─kubepods-besteffort-podd4e7731b_7cfe_4be7_b9e6_1870510f7ad7.slice (#8982)
└─kubepods-besteffort-pod88343508_49e7_4c58_a1b5_99a1aabc1661.slice (#7261)
Sorry for the delay, hope that helps.
Hi @mvazquezc
No worries, thanks for your reply.
For your response on cpu limit, I still cannot reproduce it.
I think the maximum CPU limit is 2 (=200000/100000) actually, i.e. the process can run on at most 2 CPU every 1 second. Of course, if the process is single-threaded, it can run on 1 CPU only. Do you have any idea on that?
The first value (200000) sets a quota of 200000 microseconds (or 0.2 seconds) for how long the process can run during a single period. The second value (100000) defines the period's length as 100000 microseconds (or 0.1 seconds).
With these values, the processes can run for up to 0.2 seconds during a 0.1-second period (1/5th of the period). Once they consume their time quota, they will be throttled and won't be allowed to run until the next period starts. This configuration effectively allows processes to run every 0.2 seconds of every 1 second (1/5th).
200000 microseconds cpu time for every 100000 microseconds
Apply 90% cpu load on 2 CPUs, in total 180% load
No throttling
180% CPU load
@mvazquezc
For your explanation on cpu weight, my experiment result doesn't align with that.
For process with 79 cpu weight, I think the formula should be $79/(79+39)$ $kubepods-burstable.slice / all-cpu-hungry-groups-under-kubdepods.slice$ $kubdepods.slice / all-cpu-hungry-groups-under-root-cgroup$. Assuming no cpu hungry groups besides our testing containers, the formula shall become 79/(79+39) kubepods-burstable.slice / kubepods-burstable.slice kubdepods.slice / kubdepods.slice, i.e. 79/(79+39), as all cpu hungry containers are in kubepods-burstable.slice cgroup which is in kubdepods.slice.
Similarly for container with 39 as cpu.weight
In other words, the cpu share of the process is relative ratio of root cgroup excluding those processes which don't require CPU at that moment.
Keep in mind that pods running inside the same parent slice can compete for resources. In this situation, when they’re competing for resources the
total cpu.weight
will be the one from summing all their parent cgroup cpu weights.Burstable1 and Burstable2 pods slices are created under the kubepods-burstable.slice.
The parent of kubepods-burstable.slice is kubepods.slice which means that in order to get total cpu.weight we need to sum weights for parents cgroups for the burstable pods.
kubepods-burstable.slice -> cpu.weight 86 kubdepods.slice -> cpu.weight 137
137+86 = 223
The 223 you see on the calculations comes from there.
CPU Weight & Group Hierarchy:
My machine has 4 CPU & I apply the same CPU load testing on pods with 39 CPU weight & 79 CPU weight & I can see roughly 4 67% (=79/(39+79)) load & 4 33% (=39/(39+79)) load
CPU Numbers:
Load Testing:
CPU Usage:
Pod Config in my experiement
apiVersion: v1
kind: Pod
metadata:
name: ubuntu1
spec:
containers:
- image: ubuntu
command: ["tail", "-f", "/dev/null"]
name: ubuntu1
resources:
requests:
cpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
name: ubuntu2
spec:
containers:
- image: ubuntu
command: ["tail", "-f", "/dev/null"]
name: ubuntu2
resources:
requests:
cpu: "2"
Hey @JoeHO888 I've been running some tests and I believe my formula is not correct (at least not the part of the parent weights). I got close numbers to the formula when not summing kubepods.slice + kubepods-burstable.slice weights.
For example, In my 8 cores worker I have two burstable pods. 1 requesting 2 CPUs and the other requesting 1 CPU. The weights are:
1 cpu: 39 2 cpus: 79 burstable.slice: 125 kubepods.slice: 313
If I use kubepods.slice+burstable.slice weights in the formula numbers do not match. If I use only burstable.slice numbers are close. Example:
Burstable 1 - (39/125) 8 = 2,496 CPU or 250% ~ Burstable 2 - (79/125) 9 = 5,056 or 501% ~
In each pod I run the following command: cpuload -p 100 -c 8
In the node:
430426 9999 20 0 5160 2304 1792 S 545.8 0.0 41:44.59 cpuload
430407 9999 20 0 5416 2304 1792 S 247.2 0.0 23:44.51 cpuload
Could you verify in your env? if that's the case I need to update the formula description and examples.
Apologies again for the late reply, it took a while for me to have some time to test this. And thanks again for your detailed explanation it helped a lot to reproduce the "issue".
Hi @mvazquezc,
I just tested again, the result matches yours.
My environment is slightly different from yours, but I think it doesn't matter.
My environment: My worker has only 4 CPU in total. 1 cpu: 39 2 cpus: 79 burstable.slice: 125 kubepods.slice: 157
Despite the same result as yours, I believe your formula is incorrect, the denominator should be 39+79.
Here's an counter example. I create 3 pods in total, podA requests 25m CPU, podB requests 50m CPU, but PodC requests 1000m CPU. If I load test podA & podB, their CPU usage will be 4 33% (=1/(1+2)) load & 4 67% (=2/(1+2)). In other words, only cpu.weight of cpu-demanding pods matter.
CPU weight:
CPU Usage:
Counter-example config:
apiVersion: v1
kind: Pod
metadata:
name: poda
spec:
containers:
- image: ubuntu
command: ["tail", "-f", "/dev/null"]
name: poda
resources:
requests:
cpu: 25m
---
apiVersion: v1
kind: Pod
metadata:
name: podb
spec:
containers:
- image: ubuntu
command: ["tail", "-f", "/dev/null"]
name: podb
resources:
requests:
cpu: 50m
---
apiVersion: v1
kind: Pod
metadata:
name: podc
spec:
containers:
- image: ubuntu
command: ["tail", "-f", "/dev/null"]
name: podc
resources:
requests:
cpu: 1000m
Hey @JoeHO888, you're right. Just reviewed the docs and your formula is the correct one. I'll update the docs.
Basically, from my previous example:
1 cpu: 39
2 cpus: 79
burstable.slice: 125
kubepods.slice: 313
CPU Allocation = (cpu.weight / (CPU Hungry Processes inside the parent slice)) 100 NUMCPUs.
Assuming inside kubepods-burstable.slice we only have these two pods competing for the CPU:
CPU for Burstable with 1 CPU: (39/(39+79)) 100 8 = 264,40% CPU for Burstable with 2 CPUs: (79/(39+79)) 100 8 = 535,59%
@JoeHO888, just updated the post, feel free to review this section: https://linuxera.org/cpu-memory-management-kubernetes-cgroupsv2/#how-kubepods-cgroups-compete-for-resources
Thanks again!
Hey @mvazquezc, My pleasure to contribute it. I learnt a lot from your post, especially how to convert cpu.weight to cpu.share and how to do load testing :)
Thanks for writing such good article!
CPU and Memory Management on Kubernetes with Cgroupsv2 | Linuxera
CPU and Memory Management on Kubernetes with Cgroupsv2 In this post I’ll try to explain how CPU and Memory management works under the hood on Kubernetes. If you ever wondered what happens when you set requests and limits for your pods, keep reading! Attention This is the result of my exploratory work around cgroupsv2 and their application to Kubernetes. Even though I tried really hard to make sure the information in this post is accurate, I’m far from being an expert on the topic and some information may not be 100% accurate.
https://linuxera.org/cpu-memory-management-kubernetes-cgroupsv2/