rajivramanathan commented 1 year ago

We are seeing processes stuck behind runc init. cat /proc/<pid>/status is showing the process to be in uninterruptible sleep and the load on the machines keeps rising behind this stuck process. It gets stuck forever (we've seen this for days) and the only way to recover is to reboot the machine.

Are there compatibility issues between runc v1.1.4 and RHAT 7.9 (kernel version: 3.10.0-1160.83.1.el7.x86_64)? As we have not seen this issue on our Ubuntu 20.04+ instances.

$ cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.9 (Maipo)

$ uname -r 
3.10.0-1160.83.1.el7.x86_64

$ runc --version 
runc version 1.1.4
commit: v1.1.4-0-g5fd4c4d1
spec: 1.0.2-dev
go: go1.17.10
libseccomp: 2.5.4

$ cat /proc/31369/status
Name:   runc:[2:INIT]
Umask:  0022
State:  D (disk sleep)
Tgid:   31369
Ngid:   0
Pid:    31369
PPid:   31343
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 64
Groups: 
.... <snip>

kolyshkin commented 1 year ago

Please provide more information.

A process stuck in D state probably means it is stuck in the kernel, and the kernel usually prints a backtrace to dmesg in such cases. If not, you can force the kernel to do that by doing echo w > /proc/sysrq-trigger and when looking into dmesg.

rajivramanathan commented 1 year ago

We rebooted the system already. Will there be anything of value in dmesg after the reboot?

Next time this happens, we can gather and add additional information to this thread.

We believe this is an issue with cgroups v1 and the interaction with runc (we are using cgroups v1 that's the only one available in the kernel). We are considering upgrading so, we can get a more recent kernel. If it happens before that, I can get this information.

jeff-lightup commented 1 year ago

@kolyshkin we hit this issue again over the weekend. By the time we caught this the dmesg output didn't go back far enough to capture when the process became stuck (last log lines are from the stuck process are from 2023-03-12 23:00:38 UTC). So as you suggested I triggered the output with echo w > /proc/sysrq-trigger. I've attached the full dmesg output for completeness: dmesg.txt

FYI from crictl the process that is stuck has a Container ID a5f9586d743d6 and a Pod ID 91568ee08db7b. Grepping for the container ID:

[Mon Mar 13 18:39:35 2023]             task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep

...

[Mon Mar 13 18:39:38 2023]          python3 18944    545873.796616   3380064   120         0.000000   2841473.417271         0.000000 0 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod153b3c19_a21a_4c84_a774_0fbd5a094f77.slice/cri-containerd-a5f9586d743d670db38de6cc055b0a86ff460a11819acce1476e5cec0299d02b.scope
[Mon Mar 13 18:39:39 2023]    runc:[2:INIT]  5509    545884.463242         0   120         0.000000         0.000000         0.000000 0 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod153b3c19_a21a_4c84_a774_0fbd5a094f77.slice/cri-containerd-a5f9586d743d670db38de6cc055b0a86ff460a11819acce1476e5cec0299d02b.scope
[Mon Mar 13 18:39:39 2023]    runc:[2:INIT]  5510    545884.463242         0   120         0.000000         0.000000         0.000000 0 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod153b3c19_a21a_4c84_a774_0fbd5a094f77.slice/cri-containerd-a5f9586d743d670db38de6cc055b0a86ff460a11819acce1476e5cec0299d02b.scope
[Mon Mar 13 18:39:39 2023]    runc:[2:INIT]  5511    545884.463242         0   120         0.000000         0.000000         0.000000 0 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod153b3

And grepping for the pod ID:

[Mon Mar 13 18:39:05 2023]            pause 18493        40.771883        24   120         0.000000        34.348688         0.000000 0 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod153b3c19_a21a_4c84_a774_0fbd5a094f77.slice/cri-containerd-91568ee08db7b5aa70c4c63f87e576211708f806a738651e9edb20a3f3a08794.scope

rata commented 1 year ago

This smells as it might be unrelated to runc. If you run without swap and the node is under mem pressure, Linux starts to remove the executables from mem and every time you execute something it has to read it again from disk, causing lot of IO.

There are of course lot of other things (is this a cloud VM? It might have IO credits the disk you use, and if you run out of credits all I/O operations are stalled quite a lot) that can cause this. Is runc the only process in D state? Can you do IO in other processes from the same partions that runc does IO?

I mean, more information to know it is related to runc would help

rajivramanathan commented 1 year ago

I understand your hypothesis but it doesn't seem to match other facts. This is a cloud VM that we have monitoring on. We see that there aren't any anomalous changes in CPU, memory utilization, network traffic, disk latency or disk throughput during this time. These metrics all look consistent with the time that the system/process was healthy. The only thing that we see starting to rise is load. And this particular process NEVER recovers (without rebooting the machine) even after days.

Our hypothesis is that there is some interaction between runc and cgroups that is causing the kernel to freeze up. After the freeze up happens - we cannot even list processes - "ps -ef" as it will eventually get stuck on this 1 process (probably then adding to the load as well).

jiusanzhou commented 1 year ago

Hi @rajivramanathan what's your cgroup driver,systemd or cgroupfs? If a process is in D state, maybe it has been frozen, you can check the state in freezer cgroup.

cyphar commented 1 year ago

It would be quite helpful to know where in the kernel it is sleeping by looking at /proc/<runc pid>/stack. You didn't mention what you were doing to trigger the issue -- were you trying to create new containers, or join an existing container?

Our hypothesis is that there is some interaction between runc and cgroups that is causing the kernel to freeze up. After the freeze up happens - we cannot even list processes - "ps -ef" as it will eventually get stuck on this 1 process (probably then adding to the load as well).

Seems like a kernel bug. If it is, you might be able to figure out where runc is causing the deadlock by using /proc/<runc pid>/stack (what you're describing sounds like a deadlock in the kernel) which would be helpful when reporting the issue to Red Hat.

As for the dmesg log you posted, the version you posted is missing a bunch of information. You need to run dmesg -w | tee logfile.txt first and then trigger echo w > /proc/sysrq-trigger.

jeff-lightup commented 1 year ago

Hi @jiusanzhou

what's your cgroup driver,systemd or cgroupfs?

systemd

If a process is in D state, maybe it has been frozen, you can check the state in freezer cgroup.

This is something I'm not familiar with but this is what I found... First of all this is a pod that is in this state:

CONTAINER           IMAGE               CREATED             STATE               NAME                       ATTEMPT             POD ID
b7f82c3a63ecf       5b9af4ce3f6d1       13 days ago         Running             workend                    0                   5686c92dd6bb6

I believe that this is the associated state in freezer cgroup?

/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/cri-containerd-b7f82c3a63ecf2ea819e2856545c97a285e97d4e1beac5398c7e1708dfc28ec0.scope/freezer.state
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/cri-containerd-5686c92dd6bb6171c6f41f750674606d87966f697e0e9a7ee45fd98c589718b3.scope/freezer.state
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/freezer.state

These all show THAWED:

[ec2-user@ip-10-2-250-205 freezer]$ cat /sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/freezer.state
THAWED
[ec2-user@ip-10-2-250-205 freezer]$ cat /sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/cri-containerd-b7f82c3a63ecf2ea819e2856545c97a285e97d4e1beac5398c7e1708dfc28ec0.scope/freezer.state
THAWED
[ec2-user@ip-10-2-250-205 freezer]$ cat /sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod901893d8_a13f_4a85_b02c_8b8935aa776f.slice/cri-containerd-5686c92dd6bb6171c6f41f750674606d87966f697e0e9a7ee45fd98c589718b3.scope/freezer.state
THAWED

jeff-lightup commented 1 year ago

Hi @cyphar

It would be quite helpful to know where in the kernel it is sleeping by looking at /proc//stack.

This is what I see:

[ec2-user@ip-10-2-250-205 freezer]$ sudo cat /proc/8363/stack
[<ffffffffb1fc5695>] int_careful+0x14/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff

You didn't mention what you were doing to trigger the issue -- were you trying to create new containers, or join an existing container?

We are not aware of anything that triggers the issue. The container is already running and is expected to continue running.

As for the dmesg log you posted, the version you posted is missing a bunch of information. You need to run dmesg -w | tee logfile.txt first and then trigger echo w > /proc/sysrq-trigger.

Thanks, here's the output: logfile.txt

113xiaoji commented 1 year ago

3904 I can consistently reproduce the 'runc init' entering the D state. Please refer to this issue for the reproduction method.

113xiaoji commented 1 year ago

Prevent runc init from failing to exit and ultimately becoming an orphan process when systemd is busy.

The introduction of the commit

In certain scenarios, such as when there are numerous mount points or a large number of containers running on a single server, we've observed that the CPU usage of systemd can surge to nearly 100%. Given that systemd operates on a single thread, this high CPU usage significantly impairs its ability to respond to external requests. Consequently, interactions with systemd become sluggish, affecting the normal exit of runc init, and ultimately leading to orphan processes. To address this, we have enhanced the timeout capability. If there is no response from systemd after 10 seconds during an interaction, it will actively exit and wait for the next interaction with systemd. See issue #3904 for more details.

Signed-off-by: Zhou Yongzhe 113xiaoji@163.com

rata commented 1 year ago

@113xiaoji not sure I follow, that seems like c&p from a commit message, but where is the actual commit diff? I don't see any PRs opened either, am I missing something?

113xiaoji commented 1 year ago

It's recommended to upgrade cgroup to cgroupv2 and the kernel version to 5.14 or above. This would completely avoid freezing issues, thus preventing the D (Uninterruptible Sleep) state.

113xiaoji commented 11 months ago

I think #3931 will fix it , plus cgroup to cgroupv2 and the kernel version to 5.14 or above.

DesireWithin commented 4 months ago

Our situation is similar: Kubernetes node cpu load is around 400 because of io wait. And then we delete pod by --force --grace-period=0. At first we found some errors in the kubelet log:

...
Mar 15 15:20:38 xc-dev-poc-node02 kubelet[1602173]: time="2024-03-15T15:20:38+08:00" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf5ff35fa_21c8_4edc_8b15_754f33badc2d.slice/cri-containerd-164ee23ec75e1ff4e6261d43dc4303421cd2a3c05c4c6cb6bb2cab8a3e4293b9.scope: device or resource busy"
...

At last we found runc init in D status. Here is releated dmesg:

[10878935.172430] sysrq: Show Blocked State
[10878935.172674]   task                        PC stack   pid father
[10878935.178568] runc:[2:INIT]   D    0 2834345      1 0x90000084
[10878935.178573] Call Trace:
[10878935.178586]  ? __schedule+0x296/0x910
[10878935.178588]  ? schedule_timeout+0x1ee/0x3a0
[10878935.178590]  schedule+0x28/0x80
[10878935.178594]  __refrigerator+0x44/0x160
[10878935.178599]  unix_stream_read_generic+0x84c/0x8b0
[10878935.178603]  ? finish_wait+0x80/0x80
[10878935.178604]  unix_stream_recvmsg+0x53/0x70
[10878935.178605]  ? unix_set_peek_off+0x50/0x50
[10878935.178608]  sock_read_iter+0x92/0xf0
[10878935.178613]  new_sync_read+0xfa/0x160
[10878935.178616]  vfs_read+0x89/0x130
[10878935.178618]  ksys_read+0x5a/0xd0
[10878935.178622]  do_syscall_64+0x5b/0x1d0
[10878935.178624]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10878935.178627] RIP: 0033:0x4047ce
[10878935.178634] Code: Bad RIP value.
[10878935.178635] RSP: 002b:000000c00018f328 EFLAGS: 00000206 ORIG_RAX: 0000000000000000
[10878935.178637] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004047ce
[10878935.178638] RDX: 0000000000000200 RSI: 000000c00019c400 RDI: 0000000000000003
[10878935.178638] RBP: 000000c00018f368 R08: 0000000000000000 R09: 0000000000000000
[10878935.178639] R10: 0000000000000000 R11: 0000000000000206 R12: 000000c00018f4a8
[10878935.178640] R13: 0000000000000000 R14: 000000c0000061a0 R15: 0000000000d36120
[10878935.178658] runc:[2:INIT]   D    0 2834350      1 0x10000084
[10878935.178659] Call Trace:
[10878935.178661]  ? __schedule+0x296/0x910
[10878935.178663]  schedule+0x28/0x80
[10878935.178665]  __refrigerator+0x44/0x160
[10878935.178668]  futex_wait_queue_me+0x114/0x120
[10878935.178670]  futex_wait+0xeb/0x240
[10878935.178671]  ? __switch_to_asm+0x41/0x70
[10878935.178673]  ? __hrtimer_init+0xb0/0xb0
[10878935.178675]  do_futex+0x367/0x5c0
[10878935.178677]  ? hrtimer_nanosleep+0xd0/0x1c0
[10878935.178678]  __x64_sys_futex+0x13b/0x190
[10878935.178680]  do_syscall_64+0x5b/0x1d0
[10878935.178681]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10878935.178682] RIP: 0033:0x46b5e3
[10878935.178684] Code: Bad RIP value.
[10878935.178684] RSP: 002b:00007f9a456ac120 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[10878935.178686] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000046b5e3
[10878935.178686] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000000d36620
[10878935.178687] RBP: 00007f9a456ac168 R08: 0000000000000000 R09: 0000000000000000
[10878935.178688] R10: 00007f9a456ac158 R11: 0000000000000246 R12: 00007f9a456ac158
[10878935.178689] R13: 00007fff1eb7fa3f R14: 000000c0000069c0 R15: 0000000000802000
[10878935.178698] runc:[2:INIT]   D    0 2834351      1 0x10000084
[10878935.178699] Call Trace:
[10878935.178702]  ? __schedule+0x296/0x910
[10878935.178704]  schedule+0x28/0x80

runc version:

runc version 1.1.6
commit: v1.1.6-0-g0f48801a
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.4

kernel version: 4.19.90-24.4.v2101.ky10.x86_64

rajivramanathan commented 4 months ago

BTW, this issue was fixed for us based on @113xiaoji suggestion. We upgraded the kernel and cgroups and didn't see this issue again after that.

opencontainers / runc

runc:[2:INIT] stuck - status D (disk sleep) #3759

3904 I can consistently reproduce the 'runc init' entering the D state. Please refer to this issue for the reproduction method.