mvazquezc / mvazquezc.github.io

My personal blog
1 stars 1 forks source link

cpu-memory-management-kubernetes-cgroupsv2/ #3

Open utterances-bot opened 1 year ago

utterances-bot commented 1 year ago

CPU and Memory Management on Kubernetes with Cgroupsv2 | Linuxera

CPU and Memory Management on Kubernetes with Cgroupsv2 In this post I’ll try to explain how CPU and Memory management works under the hood on Kubernetes. If you ever wondered what happens when you set requests and limits for your pods, keep reading! Attention This is the result of my exploratory work around cgroupsv2 and their application to Kubernetes. Even though I tried really hard to make sure the information in this post is accurate, I’m far from being an expert on the topic and some information may not be 100% accurate.

https://linuxera.org/cpu-memory-management-kubernetes-cgroupsv2/

JoeHO888 commented 1 year ago

Thank you for your great post, especially how to conversion between cpu share and cpu weight, couldn't find the formula before.

There are 2 things I feel confused.

For Guaranteed Pod example, why is "This specific configuration allows our processes to run every 0.2 seconds of every 1 second (1/5th)", it allows up to 200000 microseconds for every 100000, so shouldn't it be 2 seconds for every second?

For the example about different burstable pods competing CPU resources, burstable1 & burstable2, shouldn't they consume all CPU resources assuming there is no other processes competing CPU resources? specifically burstable 1 should be ~67% (=79/(79+39)) while burstable2 should be ~33% (=39/(79+39)), as cpu weight from other processes are irrelevant if they are not CPU hungry.

On top of that, I wonder should be sum kubepods.slice and kubepods-burstable.slice together, as they are in different hierarchy level, specifically kubepods-burstable.slice is under kubepods.slice. In other words, if the minimum cpu of kubepods-burstable.slice will be kubepods-burstable.slice/kubepods.slice when kubepods-besteffort.slice is also competing CPU resources

Reference: hierarchy I mentioned

mvazquezc commented 1 year ago

Hey @JoeHO888:

Trying to explain a bit better the guaranteed pod confusion you mentioned:

The first value (200000) sets a quota of 200000 microseconds (or 0.2 seconds) for how long the process can run during a single period. The second value (100000) defines the period's length as 100000 microseconds (or 0.1 seconds).

With these values, the processes can run for up to 0.2 seconds during a 0.1-second period (1/5th of the period). Once they consume their time quota, they will be throttled and won't be allowed to run until the next period starts. This configuration effectively allows processes to run every 0.2 seconds of every 1 second (1/5th).

For the burstable pod example:

Keep in mind that pods running inside the same parent slice can compete for resources. In this situation, when they’re competing for resources the `total cpu.weight` will be the one from summing all their parent cgroup cpu weights.

Burstable1 and Burstable2 pods slices are created under the kubepods-burstable.slice.

The parent of kubepods-burstable.slice is kubepods.slice which means that in order to get total cpu.weight we need to sum weights for parents cgroups for the burstable pods.

kubepods-burstable.slice -> cpu.weight 86
kubdepods.slice -> cpu.weight 137

137+86 = 223

The 223 you see on the calculations comes from there.

And answering the last part of your question:

The link you shared refers to cgroupsv1. In cgroupsv2 you can see that kubepods-burstable.slice is a child of kubepods.slice

# systemd-cgls -u kubepods.slice | grep slice

Unit kubepods.slice (/kubepods.slice):
├─kubepods-pod2667544d_8e6c_4182_9460_5873bf631057.slice (#8120)
├─kubepods-burstable.slice (#6073)
│ ├─kubepods-burstable-pod2523e459_aefd_48ed_a1d2_389dd2ffc093.slice (#7042)
│ └─kubepods-burstable-pod3629b884_7f27_405b_a89f_5871767c585a.slice (#6219)
└─kubepods-besteffort.slice (#6146)
  ├─kubepods-besteffort-pod51ef24aa_7fcf_469e_a896_56e09066991c.slice (#6375)
  ├─kubepods-besteffort-podf63a96be_c56f_4880_8afb_9b321db3df61.slice (#6302)
  ├─kubepods-besteffort-podd4e7731b_7cfe_4be7_b9e6_1870510f7ad7.slice (#8982)
  └─kubepods-besteffort-pod88343508_49e7_4c58_a1b5_99a1aabc1661.slice (#7261)

Sorry for the delay, hope that helps.

JoeHO888 commented 1 year ago

Hi @mvazquezc

No worries, thanks for your reply.

For your response on cpu limit, I still cannot reproduce it.

I think the maximum CPU limit is 2 (=200000/100000) actually, i.e. the process can run on at most 2 CPU every 1 second. Of course, if the process is single-threaded, it can run on 1 CPU only. Do you have any idea on that?

The first value (200000) sets a quota of 200000 microseconds (or 0.2 seconds) for how long the process can run during a single period. The second value (100000) defines the period's length as 100000 microseconds (or 0.1 seconds).

With these values, the processes can run for up to 0.2 seconds during a 0.1-second period (1/5th of the period). Once they consume their time quota, they will be throttled and won't be allowed to run until the next period starts. This configuration effectively allows processes to run every 0.2 seconds of every 1 second (1/5th).

200000 microseconds cpu time for every 100000 microseconds image

Apply 90% cpu load on 2 CPUs, in total 180% load image

No throttling image

180% CPU load image

JoeHO888 commented 1 year ago

@mvazquezc

For your explanation on cpu weight, my experiment result doesn't align with that.

For process with 79 cpu weight, I think the formula should be $79/(79+39)$ $kubepods-burstable.slice / all-cpu-hungry-groups-under-kubdepods.slice$ $kubdepods.slice / all-cpu-hungry-groups-under-root-cgroup$. Assuming no cpu hungry groups besides our testing containers, the formula shall become 79/(79+39) kubepods-burstable.slice / kubepods-burstable.slice kubdepods.slice / kubdepods.slice, i.e. 79/(79+39), as all cpu hungry containers are in kubepods-burstable.slice cgroup which is in kubdepods.slice.

Similarly for container with 39 as cpu.weight

In other words, the cpu share of the process is relative ratio of root cgroup excluding those processes which don't require CPU at that moment.

Keep in mind that pods running inside the same parent slice can compete for resources. In this situation, when they’re competing for resources the total cpu.weight will be the one from summing all their parent cgroup cpu weights.

Burstable1 and Burstable2 pods slices are created under the kubepods-burstable.slice.

The parent of kubepods-burstable.slice is kubepods.slice which means that in order to get total cpu.weight we need to sum weights for parents cgroups for the burstable pods.

kubepods-burstable.slice -> cpu.weight 86 kubdepods.slice -> cpu.weight 137

137+86 = 223

The 223 you see on the calculations comes from there.

CPU Weight & Group Hierarchy: image image

My machine has 4 CPU & I apply the same CPU load testing on pods with 39 CPU weight & 79 CPU weight & I can see roughly 4 67% (=79/(39+79)) load & 4 33% (=39/(39+79)) load

CPU Numbers: image

Load Testing: image image

CPU Usage: image

Pod Config in my experiement

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu1
spec:
  containers:
  - image: ubuntu
    command: ["tail", "-f", "/dev/null"]
    name: ubuntu1
    resources:
      requests:
        cpu: "1" 
---        
apiVersion: v1
kind: Pod
metadata:
  name: ubuntu2
spec:
  containers:
  - image: ubuntu
    command: ["tail", "-f", "/dev/null"]
    name: ubuntu2
    resources:
      requests:
        cpu: "2"
mvazquezc commented 1 year ago

Hey @JoeHO888 I've been running some tests and I believe my formula is not correct (at least not the part of the parent weights). I got close numbers to the formula when not summing kubepods.slice + kubepods-burstable.slice weights.

For example, In my 8 cores worker I have two burstable pods. 1 requesting 2 CPUs and the other requesting 1 CPU. The weights are:

1 cpu: 39 2 cpus: 79 burstable.slice: 125 kubepods.slice: 313

If I use kubepods.slice+burstable.slice weights in the formula numbers do not match. If I use only burstable.slice numbers are close. Example:

Burstable 1 - (39/125) 8 = 2,496 CPU or 250% ~ Burstable 2 - (79/125) 9 = 5,056 or 501% ~

In each pod I run the following command: cpuload -p 100 -c 8

In the node: 430426 9999 20 0 5160 2304 1792 S 545.8 0.0 41:44.59 cpuload
430407 9999 20 0 5416 2304 1792 S 247.2 0.0 23:44.51 cpuload

Could you verify in your env? if that's the case I need to update the formula description and examples.

Apologies again for the late reply, it took a while for me to have some time to test this. And thanks again for your detailed explanation it helped a lot to reproduce the "issue".

JoeHO888 commented 1 year ago

Hi @mvazquezc,

I just tested again, the result matches yours.

My environment is slightly different from yours, but I think it doesn't matter.

My environment: My worker has only 4 CPU in total. 1 cpu: 39 2 cpus: 79 burstable.slice: 125 kubepods.slice: 157

Despite the same result as yours, I believe your formula is incorrect, the denominator should be 39+79.

Here's an counter example. I create 3 pods in total, podA requests 25m CPU, podB requests 50m CPU, but PodC requests 1000m CPU. If I load test podA & podB, their CPU usage will be 4 33% (=1/(1+2)) load & 4 67% (=2/(1+2)). In other words, only cpu.weight of cpu-demanding pods matter.

CPU weight: image

CPU Usage: image

Counter-example config:

apiVersion: v1
kind: Pod
metadata:
  name: poda
spec:
  containers:
  - image: ubuntu
    command: ["tail", "-f", "/dev/null"]
    name: poda
    resources:
      requests:
        cpu: 25m
---        
apiVersion: v1
kind: Pod
metadata:
  name: podb
spec:
  containers:
  - image: ubuntu
    command: ["tail", "-f", "/dev/null"]
    name: podb
    resources:
      requests:
        cpu: 50m
---        
apiVersion: v1
kind: Pod
metadata:
  name: podc
spec:
  containers:
  - image: ubuntu
    command: ["tail", "-f", "/dev/null"]
    name: podc
    resources:
      requests:
        cpu: 1000m    
mvazquezc commented 1 year ago

Hey @JoeHO888, you're right. Just reviewed the docs and your formula is the correct one. I'll update the docs.

Basically, from my previous example:

1 cpu: 39
2 cpus: 79
burstable.slice: 125
kubepods.slice: 313

CPU Allocation = (cpu.weight / (CPU Hungry Processes inside the parent slice)) 100 NUMCPUs.

Assuming inside kubepods-burstable.slice we only have these two pods competing for the CPU:

CPU for Burstable with 1 CPU: (39/(39+79)) 100 8 = 264,40% CPU for Burstable with 2 CPUs: (79/(39+79)) 100 8 = 535,59%

mvazquezc commented 1 year ago

@JoeHO888, just updated the post, feel free to review this section: https://linuxera.org/cpu-memory-management-kubernetes-cgroupsv2/#how-kubepods-cgroups-compete-for-resources

Thanks again!

JoeHO888 commented 1 year ago

Hey @mvazquezc, My pleasure to contribute it. I learnt a lot from your post, especially how to convert cpu.weight to cpu.share and how to do load testing :)

Thanks for writing such good article!