opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.91k stars 2.12k forks source link

Kubelet fails to invoke runc to delete residual cgroup resources in pods. #4181

Closed bugsorry closed 9 months ago

bugsorry commented 9 months ago

Description

This problem is caused by the high CPU usage of the kubelet process. So I did a performance check through go prrof and found that kubelet took most of the time to call Destory.

Check the kubelet log. It is found that "Orphaned pod found, moving pod cgroups" is printed in the log, and the ID of the pod to be deleted is repeated each time. Therefore, I searched for the directory corresponding to the pod ID in the cgroup directory. It was found that pod resources were residual in the files subsystem of the cgroup and failed to be deleted each time.

To find out why the deletion was not successful, I went to the kubelet code. Kubelet invokes libcontainercgroups.GetCgroupMountsto obtain the subsystems of cgroup, traverses the subsystems, obtains all pod IDs, and compares them with working pods to find orphan pods. Then call manager.Destory to delete the resources of the orphan pods. libcontainercgroups.GetCgroupMounts obtains the subsystem of the cgroup in the environment through /proc/self/cgroup. However, when kubelet is deleted, manager.Destory combines the pod path with the subsystem defined in a global variable subsystem. The definition of the global variable is as follows: var subsystems = []subsystem{ &CpusetGroup{}, &DevicesGroup{}, &MemoryGroup{}, &CpuGroup{}, &CpuacctGroup{}, &PidsGroup{}, &BlkioGroup{}, &HugetlbGroup{}, &NetClsGroup{}, &NetPrioGroup{}, &PerfEventGroup{}, &FreezerGroup{}, &RdmaGroup{}, &NameGroup{GroupName: "name=systemd", Join: true}, }

Therefore, when the runc deletes cgroup resources, the list of subsystems is fixed, excluding the files subsystem. However, kubelet obtains subsystems through getCgroupMountsV1. As mentioned above, this method is obtained by querying /proc/self/cgroup. Therefore, kubelet obtains the files subsystem and creates the resource directory of the pod in the subsystem when the pod is created. After the pod is deleted, kubelet detects the orphan pod and invokes runc to delete it. However, runc does not delete resources in the files subsystem. As a result, pod resources remain and kubelet is aware of them. When there are a large number of residual pods, the number of kubelet system calls increases, affecting CPU resources.

The files subsystem is a customized subsystem in my environment and belongs to the subsystem defined by runc. Therefore, in a special OS environment, this problem occurs theoretically when k8s is used whenever the subsystem of cgroup is not defined in the subsystem of runc. Should there be a compatibility for this scenario?

Steps to reproduce the issue

1.Use an OS environment with a subsystem of cgroups other than the runc definition. 2.Deploy a Kubernetes cluster, run some pods, and delete these pods. 3.The kubelet log level is set to 3 or higher, and the logs contain information about the orphan pod.

Describe the results you received and expected

I think runc should be compatible with this particular cgroup subsystem scenario.

What version of runc are you using?

runc v1.1.3 kubernetes v1.25.3

Host OS information

NAME="EulerOS" VERSION="2.0 (SP12x86_64)" ID="euleros" VERSION_ID="2.0" PRETTY_NAME="EulerOS 2.0 (SP12x86_64)" ANSI_COLOR="0;31"

Host kernel information

Linux kube2 5.10.0-136.12.0.86.h1032.eulerosv2r12.x86_64 #1 SMP Wed Jun 28 18:34:50 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

lifubang commented 9 months ago

I looked into the code you provided, I think this may be a bug. Because for cgroupv1, there is no compare /proc/self/cgroup with subsystems in GetCgroupMounts. I think in GetCgroupMounts we should only return the cgroup subsystem mount points that cared by runc. @kolyshkin WDYT?

The files subsystem is a customized subsystem in my environment and belongs to the subsystem defined by runc. @bugsorry Could you provide the detail steps to customize a cgroup subsystem?

bugsorry commented 9 months ago

@lifubang Sorry,To customize a cgroup subsystem, you need to modify the OS kernel code. But I don't know how to modify the code. I just use someone else's customized OS to build a k8s cluster. From the code design point of view, this problem does occur, and it also occurs in my environment. But I'm sorry I can't provide a custom os image because it's a product of my company. Can you ask a kernel module expert how to implement a simple cgroup subsystem?

@bugsorry Could you provide the detail steps to customize a cgroup subsystem?

kolyshkin commented 9 months ago

@bugsorry what is files subsystem? Never heard of it.

kolyshkin commented 9 months ago

Well, this is kind of chicken-and-egg problem. You added a custom cgroup subsystem which runc knows nothing about and expect things to work. Alas, this is not possible in cgroup v1.

My suggestion, if you want a custom cgroup subsystem, use misc which is already there and is well known.

Alternatively, use cgroup v2 which solves the issue of cgroup forests (it's a unified single tree now, so no need to e.g. delete multiple directories).

With this, I think we can close it as WONTFIX (we do not support custom cgroup controllers in v1).