Open rajivramanathan opened 1 year ago
Please provide more information.
A process stuck in D state probably means it is stuck in the kernel, and the kernel usually prints a backtrace to dmesg in such cases. If not, you can force the kernel to do that by doing echo w > /proc/sysrq-trigger
and when looking into dmesg.
We rebooted the system already. Will there be anything of value in dmesg after the reboot?
Next time this happens, we can gather and add additional information to this thread.
We believe this is an issue with cgroups v1 and the interaction with runc (we are using cgroups v1 that's the only one available in the kernel). We are considering upgrading so, we can get a more recent kernel. If it happens before that, I can get this information.
@kolyshkin we hit this issue again over the weekend. By the time we caught this the dmesg
output didn't go back far enough to capture when the process became stuck (last log lines are from the stuck process are from 2023-03-12 23:00:38 UTC
). So as you suggested I triggered the output with echo w > /proc/sysrq-trigger
. I've attached the full dmesg
output for completeness: dmesg.txt
FYI from crictl
the process that is stuck has a Container ID a5f9586d743d6
and a Pod ID 91568ee08db7b
. Grepping for the container ID:
[Mon Mar 13 18:39:35 2023] task PID tree-key switches prio wait-time sum-exec sum-sleep
...
[Mon Mar 13 18:39:38 2023] python3 18944 545873.796616 3380064 120 0.000000 2841473.417271 0.000000 0 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod153b3c19_a21a_4c84_a774_0fbd5a094f77.slice/cri-containerd-a5f9586d743d670db38de6cc055b0a86ff460a11819acce1476e5cec0299d02b.scope
[Mon Mar 13 18:39:39 2023] runc:[2:INIT] 5509 545884.463242 0 120 0.000000 0.000000 0.000000 0 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod153b3c19_a21a_4c84_a774_0fbd5a094f77.slice/cri-containerd-a5f9586d743d670db38de6cc055b0a86ff460a11819acce1476e5cec0299d02b.scope
[Mon Mar 13 18:39:39 2023] runc:[2:INIT] 5510 545884.463242 0 120 0.000000 0.000000 0.000000 0 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod153b3c19_a21a_4c84_a774_0fbd5a094f77.slice/cri-containerd-a5f9586d743d670db38de6cc055b0a86ff460a11819acce1476e5cec0299d02b.scope
[Mon Mar 13 18:39:39 2023] runc:[2:INIT] 5511 545884.463242 0 120 0.000000 0.000000 0.000000 0 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod153b3
And grepping for the pod ID:
[Mon Mar 13 18:39:05 2023] pause 18493 40.771883 24 120 0.000000 34.348688 0.000000 0 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod153b3c19_a21a_4c84_a774_0fbd5a094f77.slice/cri-containerd-91568ee08db7b5aa70c4c63f87e576211708f806a738651e9edb20a3f3a08794.scope
This smells as it might be unrelated to runc. If you run without swap and the node is under mem pressure, Linux starts to remove the executables from mem and every time you execute something it has to read it again from disk, causing lot of IO.
There are of course lot of other things (is this a cloud VM? It might have IO credits the disk you use, and if you run out of credits all I/O operations are stalled quite a lot) that can cause this. Is runc the only process in D state? Can you do IO in other processes from the same partions that runc does IO?
I mean, more information to know it is related to runc would help
I understand your hypothesis but it doesn't seem to match other facts. This is a cloud VM that we have monitoring on. We see that there aren't any anomalous changes in CPU, memory utilization, network traffic, disk latency or disk throughput during this time. These metrics all look consistent with the time that the system/process was healthy. The only thing that we see starting to rise is load. And this particular process NEVER recovers (without rebooting the machine) even after days.
Our hypothesis is that there is some interaction between runc and cgroups that is causing the kernel to freeze up. After the freeze up happens - we cannot even list processes - "ps -ef" as it will eventually get stuck on this 1 process (probably then adding to the load as well).
Hi @rajivramanathan what's your cgroup driver,systemd or cgroupfs? If a process is in D state, maybe it has been frozen, you can check the state in freezer cgroup.
It would be quite helpful to know where in the kernel it is sleeping by looking at /proc/<runc pid>/stack
. You didn't mention what you were doing to trigger the issue -- were you trying to create new containers, or join an existing container?
Our hypothesis is that there is some interaction between runc and cgroups that is causing the kernel to freeze up. After the freeze up happens - we cannot even list processes - "ps -ef" as it will eventually get stuck on this 1 process (probably then adding to the load as well).
Seems like a kernel bug. If it is, you might be able to figure out where runc is causing the deadlock by using /proc/<runc pid>/stack
(what you're describing sounds like a deadlock in the kernel) which would be helpful when reporting the issue to Red Hat.
As for the dmesg log you posted, the version you posted is missing a bunch of information. You need to run dmesg -w | tee logfile.txt
first and then trigger echo w > /proc/sysrq-trigger
.
Hi @jiusanzhou
what's your cgroup driver,systemd or cgroupfs?
systemd
If a process is in D state, maybe it has been frozen, you can check the state in freezer cgroup.
This is something I'm not familiar with but this is what I found... First of all this is a pod that is in this state:
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
b7f82c3a63ecf 5b9af4ce3f6d1 13 days ago Running workend 0 5686c92dd6bb6
I believe that this is the associated state in freezer cgroup?
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/cri-containerd-b7f82c3a63ecf2ea819e2856545c97a285e97d4e1beac5398c7e1708dfc28ec0.scope/freezer.state
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/cri-containerd-5686c92dd6bb6171c6f41f750674606d87966f697e0e9a7ee45fd98c589718b3.scope/freezer.state
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/freezer.state
These all show THAWED
:
[ec2-user@ip-10-2-250-205 freezer]$ cat /sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/freezer.state
THAWED
[ec2-user@ip-10-2-250-205 freezer]$ cat /sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/cri-containerd-b7f82c3a63ecf2ea819e2856545c97a285e97d4e1beac5398c7e1708dfc28ec0.scope/freezer.state
THAWED
[ec2-user@ip-10-2-250-205 freezer]$ cat /sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/cri-containerd-5686c92dd6bb6171c6f41f750674606d87966f697e0e9a7ee45fd98c589718b3.scope/freezer.state
THAWED
Hi @cyphar
It would be quite helpful to know where in the kernel it is sleeping by looking at /proc/
/stack.
This is what I see:
[ec2-user@ip-10-2-250-205 freezer]$ sudo cat /proc/8363/stack
[<ffffffffb1fc5695>] int_careful+0x14/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff
You didn't mention what you were doing to trigger the issue -- were you trying to create new containers, or join an existing container?
We are not aware of anything that triggers the issue. The container is already running and is expected to continue running.
As for the dmesg log you posted, the version you posted is missing a bunch of information. You need to run dmesg -w | tee logfile.txt first and then trigger echo w > /proc/sysrq-trigger.
Thanks, here's the output: logfile.txt
Prevent runc init from failing to exit and ultimately becoming an orphan process when systemd is busy.
The introduction of the commit
In certain scenarios, such as when there are numerous mount points or a large number of containers running on a single server, we've observed that the CPU usage of systemd can surge to nearly 100%. Given that systemd operates on a single thread, this high CPU usage significantly impairs its ability to respond to external requests. Consequently, interactions with systemd become sluggish, affecting the normal exit of runc init, and ultimately leading to orphan processes. To address this, we have enhanced the timeout capability. If there is no response from systemd after 10 seconds during an interaction, it will actively exit and wait for the next interaction with systemd. See issue #3904 for more details.
Signed-off-by: Zhou Yongzhe 113xiaoji@163.com
@113xiaoji not sure I follow, that seems like c&p from a commit message, but where is the actual commit diff? I don't see any PRs opened either, am I missing something?
It's recommended to upgrade cgroup to cgroupv2 and the kernel version to 5.14 or above. This would completely avoid freezing issues, thus preventing the D (Uninterruptible Sleep) state.
I think #3931 will fix it , plus cgroup to cgroupv2 and the kernel version to 5.14 or above.
Our situation is similar:
Kubernetes node cpu load is around 400 because of io wait. And then we delete pod by --force --grace-period=0
.
At first we found some errors in the kubelet log:
...
Mar 15 15:20:38 xc-dev-poc-node02 kubelet[1602173]: time="2024-03-15T15:20:38+08:00" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf5ff35fa_21c8_4edc_8b15_754f33badc2d.slice/cri-containerd-164ee23ec75e1ff4e6261d43dc4303421cd2a3c05c4c6cb6bb2cab8a3e4293b9.scope: device or resource busy"
...
At last we found runc init in D status. Here is releated dmesg:
[10878935.172430] sysrq: Show Blocked State
[10878935.172674] task PC stack pid father
[10878935.178568] runc:[2:INIT] D 0 2834345 1 0x90000084
[10878935.178573] Call Trace:
[10878935.178586] ? __schedule+0x296/0x910
[10878935.178588] ? schedule_timeout+0x1ee/0x3a0
[10878935.178590] schedule+0x28/0x80
[10878935.178594] __refrigerator+0x44/0x160
[10878935.178599] unix_stream_read_generic+0x84c/0x8b0
[10878935.178603] ? finish_wait+0x80/0x80
[10878935.178604] unix_stream_recvmsg+0x53/0x70
[10878935.178605] ? unix_set_peek_off+0x50/0x50
[10878935.178608] sock_read_iter+0x92/0xf0
[10878935.178613] new_sync_read+0xfa/0x160
[10878935.178616] vfs_read+0x89/0x130
[10878935.178618] ksys_read+0x5a/0xd0
[10878935.178622] do_syscall_64+0x5b/0x1d0
[10878935.178624] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10878935.178627] RIP: 0033:0x4047ce
[10878935.178634] Code: Bad RIP value.
[10878935.178635] RSP: 002b:000000c00018f328 EFLAGS: 00000206 ORIG_RAX: 0000000000000000
[10878935.178637] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004047ce
[10878935.178638] RDX: 0000000000000200 RSI: 000000c00019c400 RDI: 0000000000000003
[10878935.178638] RBP: 000000c00018f368 R08: 0000000000000000 R09: 0000000000000000
[10878935.178639] R10: 0000000000000000 R11: 0000000000000206 R12: 000000c00018f4a8
[10878935.178640] R13: 0000000000000000 R14: 000000c0000061a0 R15: 0000000000d36120
[10878935.178658] runc:[2:INIT] D 0 2834350 1 0x10000084
[10878935.178659] Call Trace:
[10878935.178661] ? __schedule+0x296/0x910
[10878935.178663] schedule+0x28/0x80
[10878935.178665] __refrigerator+0x44/0x160
[10878935.178668] futex_wait_queue_me+0x114/0x120
[10878935.178670] futex_wait+0xeb/0x240
[10878935.178671] ? __switch_to_asm+0x41/0x70
[10878935.178673] ? __hrtimer_init+0xb0/0xb0
[10878935.178675] do_futex+0x367/0x5c0
[10878935.178677] ? hrtimer_nanosleep+0xd0/0x1c0
[10878935.178678] __x64_sys_futex+0x13b/0x190
[10878935.178680] do_syscall_64+0x5b/0x1d0
[10878935.178681] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10878935.178682] RIP: 0033:0x46b5e3
[10878935.178684] Code: Bad RIP value.
[10878935.178684] RSP: 002b:00007f9a456ac120 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[10878935.178686] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000046b5e3
[10878935.178686] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000000d36620
[10878935.178687] RBP: 00007f9a456ac168 R08: 0000000000000000 R09: 0000000000000000
[10878935.178688] R10: 00007f9a456ac158 R11: 0000000000000246 R12: 00007f9a456ac158
[10878935.178689] R13: 00007fff1eb7fa3f R14: 000000c0000069c0 R15: 0000000000802000
[10878935.178698] runc:[2:INIT] D 0 2834351 1 0x10000084
[10878935.178699] Call Trace:
[10878935.178702] ? __schedule+0x296/0x910
[10878935.178704] schedule+0x28/0x80
runc
version:
runc version 1.1.6
commit: v1.1.6-0-g0f48801a
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.4
kernel version:
4.19.90-24.4.v2101.ky10.x86_64
BTW, this issue was fixed for us based on @113xiaoji suggestion. We upgraded the kernel and cgroups and didn't see this issue again after that.
We are seeing processes stuck behind runc init.
cat /proc/<pid>/status
is showing the process to be in uninterruptible sleep and the load on the machines keeps rising behind this stuck process. It gets stuck forever (we've seen this for days) and the only way to recover is to reboot the machine.Are there compatibility issues between runc v1.1.4 and RHAT 7.9 (kernel version: 3.10.0-1160.83.1.el7.x86_64)? As we have not seen this issue on our Ubuntu 20.04+ instances.