A customer reported us an issue when attempting to join a running container inside kubernetes (kubectl exec ...). The container is running a real time application taking advantage of cores allocated to this container, the application uses the first CPU core of the allocated range as a slow thread (SCHED_OTHER policy) responsible for spawning RT threads (running under SCHED_FIFO policy) each running on a core.
They have configured kubernetes to ensure that it will allocate CPU cores within a specific range (all marked as isolated CPUs), they are using the kubernetes CPU manager with the static policy and have excluded all housekeeping CPUs from being allocated to a pod/container. Their machine is configured like this:
Customer has used this configuration successfully until RHEL 8.4, but with the introduction of this patch in 8.4, a random CPU assignment/scheduling occurs when a process enter (runc in this context) in a cgroup cpuset, before the patch addition, runc was always scheduled on the first CPU core of the cgroup cpuset, it worked fine as the first CPU core was used by a slow thread running under SCHED_OTHER policy, since the introduction of the kernel patch, runc is randomly scheduled on a core that can be fully taken by a RT threads running under SCHED_FIFO policy and with kernel.sched_rt_runtime_us=-1 there is no room for runc execution and the process get stuck, when it occurs it was observed that some other processes become unresponsive, so far systemd pid 1 was also stuck in a kernel call to proc_cgroup_show .
This is a corner case issue but serious enough to lock down a system.
Steps to reproduce the issue
Please find in attachment an archive with a reproducer based on vagrant/libvirt.
Decompress the archive and run vagrant up && vagrant halt && vagrant up
Then run a vagrant VM terminal with vagrant ssh and execute:
./reproducer.sh install
./reproducer.sh run 2-3,5
In another vagrant VM terminal, run ./reproducer.sh exec sh, the command should stuck and the system also, you shouldn't be able to open another vagrant terminal with vagrant ssh until the command in the first terminal is interrupted.
If you retry by running ./reproducer.sh run 2-3,5 in the first terminal but now ./reproducer.sh exec-patch sh in the second terminal, the system is now operating correctly (PR patch on going)
Description
A customer reported us an issue when attempting to join a running container inside
kubernetes
(kubectl exec
...). The container is running a real time application taking advantage of cores allocated to this container, the application uses the first CPU core of the allocated range as a slow thread (SCHED_OTHER policy) responsible for spawning RT threads (running under SCHED_FIFO policy) each running on a core.They have configured kubernetes to ensure that it will allocate CPU cores within a specific range (all marked as isolated CPUs), they are using the kubernetes CPU manager with the static policy and have excluded all housekeeping CPUs from being allocated to a pod/container. Their machine is configured like this:
For the kernel command line:
Relevant sysctl:
Customer has used this configuration successfully until RHEL 8.4, but with the introduction of this patch in 8.4, a random CPU assignment/scheduling occurs when a process enter (
runc
in this context) in a cgroup cpuset, before the patch addition,runc
was always scheduled on the first CPU core of the cgroup cpuset, it worked fine as the first CPU core was used by a slow thread running under SCHED_OTHER policy, since the introduction of the kernel patch,runc
is randomly scheduled on a core that can be fully taken by a RT threads running under SCHED_FIFO policy and withkernel.sched_rt_runtime_us=-1
there is no room forrunc
execution and the process get stuck, when it occurs it was observed that some other processes become unresponsive, so farsystemd
pid 1 was also stuck in a kernel call toproc_cgroup_show
.This is a corner case issue but serious enough to lock down a system.
Steps to reproduce the issue
Please find in attachment an archive with a reproducer based on vagrant/libvirt.
Decompress the archive and run
vagrant up && vagrant halt && vagrant up
Then run a vagrant VM terminal with
vagrant ssh
and execute:In another vagrant VM terminal, run
./reproducer.sh exec sh
, the command should stuck and the system also, you shouldn't be able to open another vagrant terminal withvagrant ssh
until the command in the first terminal is interrupted.If you retry by running
./reproducer.sh run 2-3,5
in the first terminal but now./reproducer.sh exec-patch sh
in the second terminal, the system is now operating correctly (PR patch on going)cpuset-issue-runc-repro.tar.gz
Describe the results you received and expected
The system stucks instead of operating correctly
What version of runc are you using?
runc 1.0.2 (but doesn't really matter here)
Host OS information
RHEL 8.X
Host kernel information
RHEL 8.X kernels