When gVisor runs under Kubernetes w/ cgroups v2 enabled guest OOMs are reported as exit code either exit code 128 or 143 (SIGTERM + 128) and the OCI OOM event is not published.
In this configuration, gVisor runs as a child (ex. /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice/cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope) of the pod's cgroup (ex. /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice). This child cgroup doesn't specify limits itself, but enforcement is inherited from the pod's cgroup. gVisor watches for OOMs with an inotify on the child cgroups memory.events file. It seems like these events must not propagate to the child.
I've been able to illustrate this with tail -n +1 memory.events cgroup.procs cri-containerd-*/memory.events cri-containerd-*/cgroup.procs to display the various cgroup membership and memory.events. Given the child cgroup is torn down immediately after gVisor exits, it's possible that memory.events is updated but the update is missed or incorrectly handled by gVisor. That said it shows 0 for all values include max which makes me suspect it's not considered since there are no limits.
Description
When gVisor runs under Kubernetes w/ cgroups v2 enabled guest OOMs are reported as exit code either exit code 128 or 143 (SIGTERM + 128) and the OCI OOM event is not published.
In this configuration, gVisor runs as a child (ex.
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice/cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope
) of the pod's cgroup (ex./sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice
). This child cgroup doesn't specify limits itself, but enforcement is inherited from the pod's cgroup. gVisor watches for OOMs with an inotify on the child cgroupsmemory.events
file. It seems like these events must not propagate to the child.I've been able to illustrate this with
tail -n +1 memory.events cgroup.procs cri-containerd-*/memory.events cri-containerd-*/cgroup.procs
to display the various cgroup membership and memory.events. Given the child cgroup is torn down immediately after gVisor exits, it's possible that memory.events is updated but the update is missed or incorrectly handled by gVisor. That said it shows 0 for all values includemax
which makes me suspect it's not considered since there are no limits.Steps to reproduce
Kubernetes + gVisor + cgroups v2-based OS (debian bookworm) https://gist.github.com/jcodybaker/dda983722831263536be04538e5eb7de
Create a pod which exceeds the memory available.
Wait for the pod to crash.
Then inspect its status:
runsc version
docker version (if using docker)
No response
uname
Linux node 6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux
kubectl (if using Kubernetes)
repo state (if built from source)
No response
runsc debug logs (if available)
No response