Issue joining cgroups cpuset with kernel scheduler task "random" distribution

Description

A customer reported us an issue when attempting to join a running container inside kubernetes (kubectl exec ...). The container is running a real time application taking advantage of cores allocated to this container, the application uses the first CPU core of the allocated range as a slow thread (SCHED_OTHER policy) responsible for spawning RT threads (running under SCHED_FIFO policy) each running on a core.

They have configured kubernetes to ensure that it will allocate CPU cores within a specific range (all marked as isolated CPUs), they are using the kubernetes CPU manager with the static policy and have excluded all housekeeping CPUs from being allocated to a pod/container. Their machine is configured like this:

For the kernel command line:

isolcpus=managed_irq,domain,2-23,26-47 nmi_watchdog=0 nohz=on nohz_full=2-23,26-47
rcu_nocb_poll=1 rcu_nocbs=2-23,26-47 irqaffinity=0,1,24,25

Relevant sysctl:

kernel.hung_task_timeout_secs = 600
kernel.nmi_watchdog = 0
kernel.sched_rt_runtime_us = -1
vm.stat_interval = 10
kernel.timer_migration = 0

Customer has used this configuration successfully until RHEL 8.4, but with the introduction of this patch in 8.4, a random CPU assignment/scheduling occurs when a process enter (runc in this context) in a cgroup cpuset, before the patch addition, runc was always scheduled on the first CPU core of the cgroup cpuset, it worked fine as the first CPU core was used by a slow thread running under SCHED_OTHER policy, since the introduction of the kernel patch, runc is randomly scheduled on a core that can be fully taken by a RT threads running under SCHED_FIFO policy and with kernel.sched_rt_runtime_us=-1 there is no room for runc execution and the process get stuck, when it occurs it was observed that some other processes become unresponsive, so far systemd pid 1 was also stuck in a kernel call to proc_cgroup_show .

This is a corner case issue but serious enough to lock down a system.

Steps to reproduce the issue

Please find in attachment an archive with a reproducer based on vagrant/libvirt.

Decompress the archive and run vagrant up && vagrant halt && vagrant up

Then run a vagrant VM terminal with vagrant ssh and execute:

./reproducer.sh install
./reproducer.sh run 2-3,5

In another vagrant VM terminal, run ./reproducer.sh exec sh, the command should stuck and the system also, you shouldn't be able to open another vagrant terminal with vagrant ssh until the command in the first terminal is interrupted.

If you retry by running ./reproducer.sh run 2-3,5 in the first terminal but now ./reproducer.sh exec-patch sh in the second terminal, the system is now operating correctly (PR patch on going)

cpuset-issue-runc-repro.tar.gz

Describe the results you received and expected

The system stucks instead of operating correctly

What version of runc are you using?

runc 1.0.2 (but doesn't really matter here)

Host OS information

RHEL 8.X

Host kernel information

RHEL 8.X kernels

opencontainers / runc