runc 1.1.15 OOMs in Kubernetes e2e tests with containerd, cgroup v2, and cgroupfs driver

samuelkarp commented 1 month ago

Description

In https://github.com/containerd/containerd/pull/10795 we're attempting to update containerd 1.6 to use runc 1.1.15. This triggers a Kubernetes e2e CI run, which is consistently failing. The same update to runc 1.1.15 succeeds with containerd 1.7 and main (which will be 2.0).

Steps to reproduce the issue

We're currently attempting to narrow down differences between the test environments used in containerd's various branches, since it seems unlikely the actual containerd version will be the cause of the difference. So far, containerd 1.6's tests are using the cgroupfs driver for managing cgroups as opposed to systemd, while the 1.7 and main branch tests use systemd to manage cgroups.

All tests are running on COS M117 and appear to be using cgroup v2 as of now.

Describe the results you received and expected

Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: runc:[2:INIT] invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-998
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: CPU: 0 PID: 52432 Comm: runc:[2:INIT] Tainted: G           O       6.6.44+ #1
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: Call Trace:
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  <TASK>
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  dump_stack_lvl+0x5d/0x80
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  dump_header+0x52/0x250
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  oom_kill_process+0x10a/0x220
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  out_of_memory+0x3bc/0x5a0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  ? mem_cgroup_iter+0x1ca/0x240
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  try_charge_memcg+0x82a/0xa90
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  charge_memcg+0x3f/0x1f0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  __mem_cgroup_charge+0x2f/0x80
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  do_pte_missing+0x544/0xc40
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  handle_mm_fault+0x7a6/0xa60
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  do_user_addr_fault+0x21b/0x770
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  exc_page_fault+0x7f/0x100
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  asm_exc_page_fault+0x26/0x30
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: RIP: 0033:0x7966028a541e
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: Code: 0c 20 4c 8b 6d 98 49 39 d3 49 89 4b 60 0f 95 c2 48 83 c8 01 49 83 c0 10 0f b6 d2 48 c1 e2 02 4c 09 e2 48 83 ca 01 49 89 50 f8 <48> 89 41 08 e9 16 fb ff ff 48 8d 0d f2 ed 0f 00 ba 13 10 00 00 48
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: RSP: 002b:00007ffffc0bb770 EFLAGS: 00010202
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: RAX: 0000000000004f21 RBX: 00007966029d2b00 RCX: 00005d2a23d5a0e0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: RDX: 00000000000000f1 RSI: 0000000000000000 RDI: 0000000000000004
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: RBP: 00007ffffc0bb830 R08: 00005d2a23d5a000 R09: 00005d2a23d59ed0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: R10: 00000000000000f0 R11: 00007966029d2aa0 R12: 00000000000000f0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: R13: 00000000000000e0 R14: 000000000000000f R15: ffffffffffffffb8
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel:  </TASK>
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: memory: usage 15360kB, limit 15360kB, failcnt 49
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: Memory cgroup stats for /kubepods/burstable/pod6e4665e8-ae02-48b1-aee9-fedeeb899d20:
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: anon 3665920
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: file 11612160
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: kernel 442368
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: kernel_stack 98304
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pagetables 147456
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: sec_pagetables 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: percpu 6944
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: sock 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: vmalloc 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: shmem 11612160
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: file_mapped 4268032
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: file_dirty 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: file_writeback 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: swapcached 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: anon_thp 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: file_thp 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: shmem_thp 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: inactive_anon 15089664
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: active_anon 159744
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: inactive_file 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: active_file 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: unevictable 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: slab_reclaimable 71080
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: slab_unreclaimable 89832
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: slab 160912
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: workingset_refault_anon 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: workingset_refault_file 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: workingset_activate_anon 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: workingset_activate_file 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: workingset_restore_anon 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: workingset_restore_file 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: workingset_nodereclaim 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgscan 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgsteal 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgscan_kswapd 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgscan_direct 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgscan_khugepaged 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgsteal_kswapd 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgsteal_direct 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgsteal_khugepaged 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgfault 1252
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgmajfault 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgrefill 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgactivate 38
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pgdeactivate 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pglazyfree 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: pglazyfreed 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: thp_fault_alloc 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: thp_collapse_alloc 0
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: Tasks state (memory values in pages):
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: [  52432] 65535 52432   401515     2634   163840        0          -998 runc:[2:INIT]
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=0eec9f496bc31187cd00a14d2debd8b4bb89cc81bfc2301ee425e30c07572398,mems_allowed=0,oom_memcg=/kubepods/burstable/pod6e4665e8-ae02-48b1-aee9-fedeeb899d20,task_memcg=/kubepods/burstable/pod6e4665e8-ae02-48b1-aee9-fedeeb899d20/0eec9f496bc31187cd00a14d2debd8b4bb89cc81bfc2301ee425e30c07572398,task=runc:[2:INIT],pid=52432,uid=65535
Oct 08 17:28:42 tmp-node-e2e-7aba7327-cos-beta-117-18613-0-76 kernel: Memory cgroup out of memory: Killed process 52432 (runc:[2:INIT]) total-vm:1606060kB, anon-rss:3496kB, file-rss:1152kB, shmem-rss:5888kB, UID:65535 pgtables:160kB oom_score_adj:-998

What version of runc are you using?

1.1.15

Host OS information

No response

Host kernel information

No response

cyphar commented 1 month ago

I'm not sure why it works with containerd 1.7, but this OOM issue with Kubernetes e2e tests is something we know about.

This is caused by #4392 (backport of #3931) which removed the code we added in #1984 to fix the same failing Kubernetes e2e test (reported in #1980). This code was removed because the bindfd logic is not as secure as using memfds and was causing significant performance problems on systems with lots of container churn. When I re-investigated the e2e issue (when working on #3931) I figured out that the OOM is happening because Kubernetes/kubelet/runc(?) is being run in a cgroup with a very low memory limit (this is not caused by the container memory limit as incorrectly claimed in #1980 because the extra memory used by memfds is never accounted by the container cgroup).

My suggestion is to change the Kubernetes e2e test to run Kubernetes in a cgroup with a more reasonable memory limit. I'm not quite sure what the purpose of a test where you run the entire Kubernetes/kubelet stack with such a tiny memory limit is -- sure, containers with a small memory limit makes some sense, but why run the management code in such a constrained setup?

samuelkarp commented 1 month ago

Do you have a link to where containerd or runc is being constrained to a low cgroup memory limit? containerd is run as a normal systemd unit with no memory limit and is itself responsible for directly launching runc (so runc should start inside containerd's own cgroup).

cyphar commented 1 month ago

I'm not super familiar with Kubernetes e2e tests, but I believe KubeReservedMemory is setting the limits of the cgroup that containerd runs in (or something equivalent). See runOomKillerTest:

func runOomKillerTest(f *framework.Framework, testCase testCase, kubeReservedMemory float64) {
    ginkgo.Context(testCase.name, func() {
        // Update KubeReservedMemory in KubeletConfig.
        if kubeReservedMemory > 0 {
            tempSetCurrentKubeletConfig(f, func(ctx context.Context, initialConfig *kubeletconfig.KubeletConfiguration) {
                if initialConfig.KubeReserved == nil {
                    initialConfig.KubeReserved = map[string]string{}
                }
                // There's a race condition observed between system OOM and cgroup OOM if node alocatable
                // memory is equal to node capacity. Hence, reserving a fraction of node's memory capacity for
                // K8s components such that node allocatable memory is less than node capacity to
                // observe OOM kills at cgroup level instead of system OOM kills.
                initialConfig.KubeReserved["memory"] = fmt.Sprintf("%d", int(kubeReservedMemory*getLocalNode(context.TODO(), f).Status.Capacity.Memory().AsApproximateFloat64()))
            })
        }
/* ... */
}

The issue here is not the container memory limit, because we know that cannot cause this error (the kernel allocates memory against memcgs based on the cgroup you were in when the memory was allocated -- our memfds are allocated and filled while in the host cgroups -- and the kernel will never migrate memory accounting on cgroupv2 even if you change cgrouops, and even with cgroupv1 you need to enable an option that is disabled on all systems by default). This was not obvious in #1980 but upon re-examination in #3931 it became clear that this is about the host memory limits, not the container ones.

samuelkarp commented 1 month ago

I believe KubeReservedMemory is setting the limits of the cgroup that containerd runs in (or something equivalent).

I don't believe that's correct. My understanding is that KubeReservedMemory controls the amount of advertised Memory resource to the Kubernetes scheduler, not that it translates to any cgroup setting on the node itself. @yujuhong, is that correct?

yujuhong commented 1 month ago

Yes, kube-reserved memory was used to calculate the node allocatable resources. Kubelet doesn't actually set the cgroup values for it. (BTW, for compressible resources like cpu, this PR was merged recently for kubelet to enforce the cgroup values if the cgroup path is provided: https://github.com/kubernetes/kubernetes/pull/125982)

samuelkarp commented 1 month ago

The issue here is not the container memory limit

Yes, I saw your explanation before and that makes sense. I have also validated that our CI hosts are running cgroup v2. But I don't think we understand why it is failing; as far as I can tell there are no constraints on the cgroup runc is running in.

rata commented 1 month ago

It's EOD here and I'm just getting some progress with this. I'll expand more tomorrow, including the repro I found.

It's still not 100% clear to me what is causing this. I can fix the k8s tests locally with this patch to the k8s repo (github.com/kubernetes/kubernetes):

diff --git test/e2e_node/oomkiller_linux_test.go test/e2e_node/oomkiller_linux_test.go
index db84cb0f0cf..ebc99abfa19 100644
--- test/e2e_node/oomkiller_linux_test.go
+++ test/e2e_node/oomkiller_linux_test.go
@@ -108,7 +108,8 @@ func runOomKillerTest(f *framework.Framework, testCase testCase, kubeReservedMem
                                // memory is equal to node capacity. Hence, reserving a fraction of node's memory capacity for
                                // K8s components such that node allocatable memory is less than node capacity to
                                // observe OOM kills at cgroup level instead of system OOM kills.
-                               initialConfig.KubeReserved["memory"] = fmt.Sprintf("%d", int(kubeReservedMemory*getLocalNode(context.TODO(), f).Status.Capacity.Memory().AsApproximateFloat64()))
+                               //initialConfig.KubeReserved["memory"] = fmt.Sprintf("%d", int(kubeReservedMemory*getLocalNode(context.TODO(), f).Status.Capacity.Memory().AsApproximateFloat64()))
+                               initialConfig.KubeReserved["memory"] = "100Mi"
                        })
                }

I can also make the tests pass changing the memory limits of them from 15Mi to 20Mi. I'm not sure why is that. I guess that the kubelet makes sure that the cgroup it is in has at least the mem limit and that is why that approach also works.

However, it is not 100% clear to me yet what the issue is nor what the right fix is.

I've also isolated the config.json created for the container (k8s uses two containers, the pause container and the real container, I'm just using the real container as the pause container doesn't have any limits nor anything interesting in the config) and then run it with runc. However, the container runs just fine when runnning it with runc.

Due to that, I suspect the issue is as @cyphar was saying, something about the setup that allocates too little mem for the KubeReserved. If I allocate a static "100Mi" for it, I see it pass locally, and if I allocate too litle (like 10Mi), I see it fail for all tests.

Please note, though, that when it fails it shows the same error than in the containerd issue comment:

  [FAILED] pod: "initcontinar-oomkill-target-pod", container: "oomkill-target-init-container" has unexpected exitCode: '\u0080'
  Expected
      <int32>: 128
  to equal
      <int32>: 137
  In [It] at: k8s.io/kubernetes/test/e2e_node/oomkiller_linux_test.go:152 @ 10/09/24 18:42:39.812

Therefore, what is failing in the test is the exit code. It is expecting 137 but gets 128. When the kernel kills runc[init] it seems it gets 128, but when it kills the "dd" process running in the container, it gets the right exit code.

Indeed, it seems the two test that fail locally are because runc[INIT] is being killed by a OOM in a cgroup. It correlates perfectly with what I see in the kernel logs locally:

# First test gets OOM on the runc process

[Fri Oct 11 02:51:37 2024] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=8a6434811b9e6f4d415ea9e1cd3e66898d1257cc8947c67054c3092c1216df9d,mems_allowed=0,oom_memcg=/kubepods/burstable/poda6a4366a-76d7-46a0-b52e-c96317070a0e,task_memcg=/kubepods/burstable/poda6a4366a-76d7-46a0-b52e-c96317070a0e/8a6434811b9e6f4d415ea9e1cd3e66898d1257cc8947c67054c3092c1216df9d,task=runc:[2:INIT],pid=554398,uid=999
[Fri Oct 11 02:51:37 2024] Memory cgroup out of memory: Killed process 554398 (runc:[2:INIT]) total-vm:1605432kB, anon-rss:1744kB, file-rss:944kB, shmem-rss:4116kB, UID:999 pgtables:168kB oom_score_adj:-997
[Fri Oct 11 02:51:37 2024] Tasks in /kubepods/burstable/poda6a4366a-76d7-46a0-b52e-c96317070a0e/8a6434811b9e6f4d415ea9e1cd3e66898d1257cc8947c67054c3092c1216df9d are going to be killed due to memory.oom.group set
[Fri Oct 11 02:51:37 2024] Memory cgroup out of memory: Killed process 554398 (runc:[2:INIT]) total-vm:1605432kB, anon-rss:1744kB, file-rss:944kB, shmem-rss:4116kB, UID:999 pgtables:168kB oom_score_adj:-997
[Fri Oct 11 02:51:37 2024] runc:[2:INIT] invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-997

# Second tests gets OOM on the runc process
[Fri Oct 11 02:51:37 2024] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=379e12af22e9387021abbb33d87ffd37862b563f7849158ddb64fd2b3b380256,mems_allowed=0,oom_memcg=/kubepods/burstable/podffbb6b04-1c85-4039-8804-a8b793230f23,task_memcg=/kubepods/burstable/podffbb6b04-1c85-4039-8804-a8b793230f23/379e12af22e9387021abbb33d87ffd37862b563f7849158ddb64fd2b3b380256,task=runc:[2:INIT],pid=554461,uid=999
[Fri Oct 11 02:51:37 2024] Memory cgroup out of memory: Killed process 554461 (runc:[2:INIT]) total-vm:1605944kB, anon-rss:2348kB, file-rss:912kB, shmem-rss:3432kB, UID:999 pgtables:168kB oom_score_adj:-997
[Fri Oct 11 02:51:37 2024] Tasks in /kubepods/burstable/podffbb6b04-1c85-4039-8804-a8b793230f23/379e12af22e9387021abbb33d87ffd37862b563f7849158ddb64fd2b3b380256 are going to be killed due to memory.oom.group set
[Fri Oct 11 02:51:37 2024] Memory cgroup out of memory: Killed process 554461 (runc:[2:INIT]) total-vm:1605944kB, anon-rss:2348kB, file-rss:912kB, shmem-rss:3432kB, UID:999 pgtables:168kB oom_score_adj:-997

So, two times the runc init has been killed by the cgroup OOM, the two tests that fail. The other OOM I see in my logs is for the "dd" process (that is the one allocating memory) and that is passing (killing dd with oom gets 137 and not 128 exit code, that makes the test pass).

So, it seems the k8s test is setting-up a cgroup with very low mem for runc to run, this is probably due to the KubeReserved thingy, as changing that to a higher number makes the tests pass just fine here.

The current allocation is a percentage of the node capacity, as @cyphar pointed out: https://github.com/kubernetes/kubernetes/blob/95ec69c16c76b5ee71fdbebc7a5dea2c39341eb3/test/e2e_node/oomkiller_linux_test.go#L111:

        if kubeReservedMemory > 0 {
            tempSetCurrentKubeletConfig(f, func(ctx context.Context, initialConfig *kubeletconfig.KubeletConfiguration) {
                if initialConfig.KubeReserved == nil {
                    initialConfig.KubeReserved = map[string]string{}
                }
                // There's a race condition observed between system OOM and cgroup OOM if node alocatable
                // memory is equal to node capacity. Hence, reserving a fraction of node's memory capacity for
                // K8s components such that node allocatable memory is less than node capacity to
                // observe OOM kills at cgroup level instead of system OOM kills.
                initialConfig.KubeReserved["memory"] = fmt.Sprintf("%d", int(kubeReservedMemory*getLocalNode(context.TODO(), f).Status.Capacity.Memory().AsApproximateFloat64()))
            })
        }

My gut feeling is that the VMs running the CI for containerd 1.6 have less mem than the ones for 1.7 and main, and this calculation gets too low with that VM and the new runc binary is slightly bigger and makes it fail when it seals its on-memory copy of the binary (to fix a CVE, we really want that :)).

@samuelkarp some questions for you:

Can you confirm what is the memory allowed for the VMs running for containerd 1.6, 1.7 and main? I guess it's different size.
Can you check what is assigned to the kubeReserved memory in containerd 1.6 and 1.7 CI? It would be great to check what those numbers are for the CI.
Can you try to run the CI for 1.6 with the patch I pasted here for the k8s bits? I guess if we use 100Mi the CI will pass just fine.
Another fix might be to stop checking the return code. This fixes the tests too. The return code can change according to what process is OOM-Killed (not sure why yet, although it probably makes sense and can't be changed in any reasonable way), so stop checking it might be an option too.

My gut feeling is that this calculation based on percentage is not very good, we should at least add a max between that number and 100Mi or so. But I'd check out later what the right fix might be for this, I'm too tired now :-D

samuelkarp commented 1 month ago

I found the relevant difference between the containerd 1.6 CI and 1.7 + main CI is which cgroup driver is in use: cgroupfs vs. systemd. When moving containerd 1.6 to use the systemd cgroup driver, the tests no longer fail.

dims commented 1 month ago

@samuelkarp +1 to use systemd cgroup driver consistently!

samuelkarp commented 1 month ago

@dims this isn't quite about consistency. When both Kubelet and containerd are configured to use cgroupfs, the test fails. When both Kubelet and containerd are configured to use the systemd cgroup driver, the test passes. I didn't try to test a mixed mode (I don't think we'd consider that to be supported).

dims commented 1 month ago

@samuelkarp ah that sounds bad.

akhilerm commented 1 month ago

I didn't try to test a mixed mode (I don't think we'd consider that to be supported).

Using different driver on container runtime and kubelet will not work, as we have seen errors in kubernetes tests when done like that.

samuelkarp commented 1 month ago

@rata

It's EOD here and I'm just getting some progress with this. I'll expand more tomorrow, including the repro I found.

If you have time to share the information you found, that would very much be appreciated :smile:.

My gut feeling is that the VMs running the CI for containerd 1.6 have less mem than the ones for 1.7 and main [...]

Can you confirm what is the memory allowed for the VMs running for containerd 1.6, 1.7 and main? I guess it's different size.

All the jobs are running on GCE e2-medium VMs with 1 vCPU (2 fractional vCPUs) and 4 GiB of memory.

Can you check what is assigned to the kubeReserved memory in containerd 1.6 and 1.7 CI? It would be great to check what those numbers are for the CI.

Can you try to run the CI for 1.6 with the patch I pasted here for the k8s bits? I guess if we use 100Mi the CI will pass just fine.

I'll see if we can find answers for these; I'm a containerd maintainer not a Kubernetes maintainer so I'm less familiar with how these tests are run (through the Kubernetes test infrastructure, and Kubernetes test source code). But the VM sizes are the same for 1.6, 1.7, and main, so they should all have the same kubeReserved calculation.

I'm not sure if you saw above, but I did find a difference:

cgroup v2 + cgroupfs - this is the failing configuration
cgroup v2 + systemd cgroup - this is the passing configuration

We don't currently have cgroup v1 test coverage in the Kubernetes e2e/containerd CI jobs; fixing that is a separate issue. Does runc have different codepaths depending on whether cgroupfs or the systemd cgroup driver is in use? I see a --systemd-cgroup flag documented in the man page.

to fix a CVE, we really want that :)

Yep, I am familiar and agree that the CVE should remain mitigated. I also am familiar with the impact to systemd for the bindfd mitigation, so I understand why that's undesirable as well.

lifubang commented 1 month ago

My suggestion is to change the Kubernetes e2e test to run Kubernetes in a cgroup with a more reasonable memory limit.

Yes, if this can be changed in kubernetes e2e test, it will be better.

The core reason is that the memory accounting should not include runc init processes before we start the container's init process. So, we can join the cgroup as later as possible, at least after call ensure_cloned_binary, then the memory usage will not be included in cgroup memory accounting. (LIKE #4438).

Please consider whether this is worth to draft a new release 1.1.16 or not?

lifubang commented 1 month ago

Could you help to test #4439 in containerd 1.6?

I have test it in my machine, it works fine:

The config.json:

root@acmcoder:/opt/ubuntu# cat config.json | jq .process.args
[
"cat",
"/sys/fs/cgroup/memory.peak",
"/sys/fs/cgroup/memory.max",
"/sys/fs/cgroup/cgroup.procs"
]
root@acmcoder:/opt/ubuntu# cat config.json | jq .linux.resources.memory
{
"limit": 10485760
}

So, the first line of output from the container is the peak usage of memory, the second line is the limit of memory.

Using runc 1.1.15, the peak memory usage will big than the limit for sometimes:

root@acmcoder:/opt/ubuntu# ./runc.1.1.15 -v
runc version 1.1.15
commit: v1.1.15-0-gbc20cb44
spec: 1.0.2-dev
go: go1.22.3
libseccomp: 2.5.5
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
5681152
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
2572288
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
8728576
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
5181440
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
11657216
10485760
1
# peak > max
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
2629632
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
11231232
10485760
1
# peak > max

Using #4439, there is no situation of peak>max:

root@acmcoder:/opt/ubuntu# ./runc.1.1-dev -v
runc version 1.1.15+dev
commit: v1.1.13-40-gea5f8e04
spec: 1.0.2-dev
go: go1.23.2
libseccomp: 2.5.5
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2789376
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2916352
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
3239936
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2711552
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2662400
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
3051520
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2797568
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
3293184
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2789376
10485760
1

lifubang commented 1 month ago

Please consider whether this is worth to draft a new release 1.1.16 or not?

And I think there is no this situation in the main branch, because the binary clone operation has been moved from runc init to runc create/run. Could you consider waiting for 1.2.0 release?

akhilerm commented 1 month ago

I have triggered a test to use runc https://github.com/opencontainers/runc/pull/4439/commits/ea5f8e0445e80dc29acd9b5d57481a0e936af89b.

cgroupfs job: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/containerd_containerd/10797/pull-containerd-node-e2e-1-6/1844595194326945792 systemd job: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/containerd_containerd/10797/pull-containerd-node-e2e-1-6-systemd-cgroup/1844595302615486464

akhilerm commented 1 month ago

The test with the https://github.com/opencontainers/runc/pull/4439 fix has passed for both cgroupfs and systemd

rata commented 1 month ago

@samuelkarp @akhilerm @lifubang Great findings, thanks!

I think I found a bug in the PR @lifubang created (commented there). But when trying to fix it, I think I might have found another bug that was there for a while. More details and links below :)

Repro the k8s test failure locally

But first things first. Let me share my setup to run the kubernetes tests locally, as it is definitely not trivial to find (or it wasn't for me :D).

Checkout the containerd repo, start containerd running as root. I'm just doing this:

make && sudo make install && sudo bin/containerd --config test-runc-mem-issue.toml

The config doesn't seem to be special. I'm using version 2 and just changed from the default (generated with bin/containerd condig default a few years ago, probably what containerd 1.6/1.7 dumps today) this:

root = "/var/lib/containerd-rata"
state = "/run/containerd-rata"
[grpc]
  address = "/run/containerd-rata/containerd.sock"

I tried chaning the systemd_cgroup to true/false, but it didn't seem to make a difference. My config is here, just in case: https://paste.debian.net/1331999/

I'm using containerd in this commit 906c23218c21373961bae0665a1dec6b89852fc2. Although I don't think this is relevant either.

Then, checkout the kubernetes repo (github.com/kubernetes/kuberentes). I'm using this commit, although it doesn't seem to matter either: d9c46d8ecb1ede9be30545c9803e17682fcc4b50.

Then, do this from the repo root:

export PATH="/home/rodrigo/src/kinvolk/kubernetes/kubernetes/third_party/etcd:${PATH}"
export CONTAINER_RUNTIME_ENDPOINT=/run/containerd-rata/containerd.sock
export IMAGE_SERVICE_ENDPOINT=/run/containerd-rata/containerd.sock
export CONTAINER_RUNTIME=remote

cp test/e2e_node/jenkins/default-kubelet-config.yaml /tmp/e2e-kubelet.yaml
# Edit the file: remove empty lines and comments and set failSwapOn: false. Otherwise the kubelet fails to start
vim /tmp/e2e-kubelet.yaml

# Compile the test binaries
make all

# Finally run the tests
KUBELET_CONFIG_FILE="/tmp/e2e-kubelet.yaml" KUBE_ROOT="$PWD" FOCUS="OOMKiller" hack/make-rules/test-e2e-node.sh

Here it doesn't fail reliably (but it never takes more than 3 tries to hit the fail) and that is what confused me in the previous message. In the CI does fail/pass reliably, but not locally for me, which I didn't expect.

Also, while debugging the issue locally, I created a bash script to wrap runc. I printed the content of /proc/self/cgroup and there is indeed no cgroup applies to runc, it always printed the file as empty. So, there seems to be no cgroup in which runc is running for these tests. That is not creating the OOM

The runc-wrapper script is just this simple script, just in case:

#!/bin/bash

echo "Running: runc $@" >> /tmp/rata.log
echo "running runc in cgroup: $(cat /proc/selc/cgroup)" >> /tmp/rata.log
exec <path to real runc> "$@"

Fixing this issue

In my previous message I thought the kubelet was creating a cgroup to run runc and that was failing. It seems the kubelet can do that (for the kubeReserved, etc.) but it is not doing it in the tests that fail. The fact that fails reliably on CI and not locally, plus that it was kind of late here, confused me the other day. Let's see if this time it is better, as it's also getting late now :D.

@lifubang great findings! I think that fix sounds reasonable at first sight, but it might be introducing a bug (explained it here). I've experimented with a way to fix it here: https://github.com/rata/runc/commits/runc-1.1-mem-issue/.

EDIT: The race doesn't seem possible, because nsexec does another read (a lot of them!) before exec'ing into the user process. So this is safe, the comment in the code is about applying the cgroup before sending the bootstrapData, so we sync on that. Sorry for the noise, I leave below what race I thought there was.

While doing it, I realized we are not waiting to join the cgroup for the setns case (we were only doing it for the initProcess case). In other words, on the setns case, the nsexec.c might continue and create the process before we apply it to the cgroup (we are not syncing on that, although it is unliklely to win that race in practice). My branch there changes that too. It seems to me this lack of sync was on purpose, but I can't see why now. Any thoughts?

In fact, I'd need to check if in (*initProcess).start(), the var p.messageSockPair.parent that is of time *os.File, is buffered or a write there blocks until the other end reads? Because this comment seems to assume that nsexec.c when writing here it will block until the go side reads. But I'm unsure that is the case (I'll test it later).

If that is the case, though, my branch is fixing it in any case, as we are doing a read on nsexec now (that will block until there is something to read).

My branch is a work in progress, not ready for a PR yet. But enough to play and experiment further on this.

Am I missing something? @cyphar can you take a look at my branch regarding the nsexec races I explained above?

It's not clear to me yet why on CI only the cgroupfs driver seems to be affected, if anyone have any thoughts on this, please let me know :)

Some TODOs so I don't forget:

[ ] Verify lifubang example using systemd/cgroups driver if it's not too painful to test. This will gives us some light if CI is "getting lucky" or if it doesn't seem to be affected.
[ ] If systemd driver is not affected, understand why
[x] Verify if the two races in nsexec are indeed a problem and verify (setns not waiting for cgroup, standard maybe not waiting today as the socket might not block on write?)
[ ] Verify if main is affected, the nsexec code changed considerably
[x] Polish and verify that on error paths no process is left hanging.
[ ] Add tests for what we can cover
[ ] Try to revert https://github.com/opencontainers/runc/pull/4020 and its 1.1 backport. It might not be needed after the fixes.

lifubang commented 1 month ago

Unfortunately, #4399 will influence runc's performance, please see https://github.com/opencontainers/runc/pull/4439#discussion_r1796784791

100 containers test result: runc-1.1.15: 3.025s patch 4399: 4.123s

But we should know that patch 4399 has the same start time usage with v1.2.0-rc.3. This is because the time of join cgroup is a little less than the time of binary clone, so let them executed sequentially will raise the start time usage of the container. So #4399 and v1.2.0-rc.3 has the same start time usage.

In another words, we should also have to find a solution to reduce the runc start time usage for v1.2.0.

lifubang commented 1 month ago

I strongly recommend to change the Kubernetes e2e test to run Kubernetes in a cgroup with a more reasonable memory limit. For example: 20M.

tianon commented 1 month ago

@rata is /proc/selc/cgroup (selc) a typo just in your comment or also in your wrapper script? 👀

rata commented 1 month ago

@tianon great catch! It's a typo, I adjusted to remove some other stuff I have, sorry :D

rata commented 1 month ago

@lifubang How did you test the container time? Just a bash for loop and the time it takes to do all of them? Can you share, just in case?

Also, 1100ms difference on 100 containers is just 11ms increase time for each creation. And if it's on-par with main, is this really a big issue? Or am I again too tired now? :D

lifubang commented 1 month ago

Just a bash for loop and the time it takes to do all of them?

Yes.

Can you share, just in case?

I can share it tomorrow, because it's in my another PC.

My step to test: 1) I put all versions runc we want to test in current dir. 2) Update config.json: .process.args = [ "true" ] | .process.terminal = false; 3) ./bench.sh

#!/bin/bash
ver=$1
for i in {1..100}; do
  ./runc-$ver run test
done

4) Run it: time ./bench.sh 1.1.15

akhilerm commented 1 month ago

I am hitting this issue randomly in my local setup, when using vagrant boxes with systemd while running CRI tests. Tried using the exact same steps as used in the containerd CI.

rata commented 1 month ago

This seems to also affect other containerd branches, not just 1.6 with cgroupfs. See this from main (main is not using cgroupfs):

IMHO, we should release 1.1.16 as soon as we merge #4439

None situation is ideal, either without this fix or with this fix. I think these test's mem limits are kind of artificial and probably not super relevant in real life, but it is breaking the CI for users, so it is a real problem. On the other hand, with #4439, the memory of the clone will go to the main runc process (so outside the container cgroup) and maybe real setups might have an issue with that (like maybe people using kube-reserved and a lot of churn? or kubernetes clusters with readiness probes that use exec + using kube-reserved?).

For now the latter is a theretical idea that we don't know it will happen or not, so IMHO let's just fix the real problems that we have, as #4439 does, and if this is a problem for users, of course we can revert and ask projects to adjusts the CI tests.

samuelkarp commented 1 month ago

I see runc 1.2.0 was released. A few questions:

Is there still a plan for 1.1.6? Or is the recommendation that downstreams (such as containerd, distros, cloud providers, etc) migrate to 1.2.0?
In general, is there an expected support lifetime for the 1.1.x series?
Is our expectation that the new fsopen(2)-based overlayfs approach (https://github.com/opencontainers/runc/pull/4448) will avoid the issues with systemd mount tracking?

rata commented 1 month ago

@samuelkarp thanks, I wanted to update here.

I think 1.1.16 is not planned now. The idea is for downstreams to update (as it was when 1.1 was released). Let us know if you have any issues.
Currently only the latest release is supported. That is 1.2.x now. I'd appreciate your input here, if you have any.
Yes, exactly. We did some stress tests for that and the containerd PR opened with main is not failing with that. Let us know if you find any issues with that, though.

samuelkarp commented 1 month ago

Apologies for the delay; I was away for a few days.

Thanks, that makes sense. I don't have an immediate need or 1.1.16 if 1.2.0 is out (and is bug-free). https://github.com/containerd/containerd/pull/10877#issuecomment-2450201300 makes it sound like we are waiting on 1.2.1 now.
Thanks, that was also my understanding.
Thanks!

opencontainers / runc