OOM OCI Events Broken for Kubernetes + CgroupsV2

jcodybaker commented 10 months ago

Description

When gVisor runs under Kubernetes w/ cgroups v2 enabled guest OOMs are reported as exit code either exit code 128 or 143 (SIGTERM + 128) and the OCI OOM event is not published.

In this configuration, gVisor runs as a child (ex. /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice/cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope) of the pod's cgroup (ex. /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice). This child cgroup doesn't specify limits itself, but enforcement is inherited from the pod's cgroup. gVisor watches for OOMs with an inotify on the child cgroups memory.events file. It seems like these events must not propagate to the child.

I've been able to illustrate this with tail -n +1 memory.events cgroup.procs cri-containerd-*/memory.events cri-containerd-*/cgroup.procs to display the various cgroup membership and memory.events. Given the child cgroup is torn down immediately after gVisor exits, it's possible that memory.events is updated but the update is missed or incorrectly handled by gVisor. That said it shows 0 for all values include max which makes me suspect it's not considered since there are no limits.

$ pwd
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice

$ tail -n +1 memory.events cgroup.procs cri-containerd-*/memory.events cri-containerd-*/cgroup.procs

==> memory.events <==
low 0
high 0
max 566
oom 23
oom_kill 23
oom_group_kill 0

==> cgroup.procs <==

==> cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope/memory.events <==
low 0
high 0
max 0
oom 0
oom_kill 0
oom_group_kill 0

==> cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope/cgroup.procs <==
139422
139423
139452
139473
139540

Steps to reproduce

Kubernetes + gVisor + cgroups v2-based OS (debian bookworm) https://gist.github.com/jcodybaker/dda983722831263536be04538e5eb7de

Create a pod which exceeds the memory available.

cat << 'EOF' | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: example
  name: example
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: example
  template:
    metadata:
      labels:
        app: example
    spec:
      containers:
      - command:
        - bash
        - -c
        - big_var=data; while true; do big_var="$big_var$big_var"; done
        image: ubuntu:jammy
        name: ubuntu
        resources:
          limits:
            cpu: "1"
            memory: 512Mi
          requests:
            cpu: 200m
            ephemeral-storage: 200M
            memory: "214748364"
      dnsPolicy: Default
      hostNetwork: true
      restartPolicy: Always
      runtimeClassName: gvisor-ptrace
      tolerations:
      - operator: Exists
EOF

Wait for the pod to crash.

Then inspect its status:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    128
      Started:      Thu, 16 Nov 2023 10:12:13 -0500
      Finished:     Thu, 16 Nov 2023 10:12:18 -0500

runsc version

runsc version release-20231106.0
spec: 1.1.0-rc.1

docker version (if using docker)

No response

uname

Linux node 6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux

kubectl (if using Kubernetes)

v1.28.2

repo state (if built from source)

No response

runsc debug logs (if available)

No response

github-actions[bot] commented 5 months ago

A friendly reminder that this issue had no activity for 120 days.

github-actions[bot] commented 2 months ago

This issue has been closed due to lack of activity.

google / gvisor