Closed jcodybaker closed 1 year ago
I tried addressing this in code here: https://github.com/google/gvisor/pull/9575
If this approach seems viable, I can add tests and formalize the PR. Or feel free copy and adapt it.
Thanks for the detailed report! IIUC we will have to walk the full cgroup path to calculate the real memory limit and cpu quotas. Otherwise there may be cases where we miss a limit set higher up in the hierarchy. If you modify your draft PR to do this I can review it and we can merge it in.
Hi Lucas, thanks for the quick attention to this.
With regard to walking the cgroup tree, I'm open to that but worry it won't produce a stable result. For example, if the gVisor pod is kubernetes qos burstable (cpu/memory req != limit), the pod will get a pod slice cgroup in /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice
. When a guaranteed QOS pod (limit == request) is scheduled to the node, the kubelet it gets placed directly in /sys/fs/cgroup/kubepods.slice/
and the kubelet will reduce /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/memory.max
accordingly. Since --total-memory
and --num-cpus
are only configured at the initial sandbox create these will become stale.
My expectation folks want the --total-memory
and --num-cpus
values to reflect pod limits, even in cases when those limits could not be satisfied. Similarly, pods without limits could find themselves in a place where a guaranteed pod is removed, allocating more memory back to the parent kubepods.sliace and causing making more memory/cpu available than is listed in the initial --total-memory
value.
Ah yep, you are right. I think the issue is actually with how we're setting up our sytemd cgroupv2 paths initially. I should have a PR that will patch this soon.
I'm curious what you come up with there. I initially had similar thoughts after seeing what the shim does for parent cgroups in the non-systemd cgroups v1 code. In that case pids are bound to the parent pod cgroup itself, rather than the CRI. In that case the child cgroups per-container are instantiated, but it seems to largely service cadvisor--specifically the limits related metrics like container_spec_memory_limit_bytes
and friends.
That doesn't seem workable in systemd because the pod cgroup is a slice unit (which cannot hold processes). Since the container id / cgroups path for the workload container(s) (non-pause containers) isn't known when the sandbox is launched, there's not much choice but to put the sandbox into the subgroup for the pause container. I suppose the pids could be moved to a new cgroup for the workload container(s) but with multiple containers possible that seemed not-ideal.
Finding a replacement for those cadvisor limit metrics is the next task on my list, so if there's a way to make that work in the process I'm interested.
Could you share your debug logs from running the container? From your containerd config it should be in /var/log/runsc/%ID%/gvisor.%COMMAND%.log
After a bit more digging, the issue here seems to be that your command kubectl exec $(kubectl get pods -o name) -- sh -c 'free -m && cat /proc/cpuinfo'
runs inside the sandbox and reads from the sandbox-internal sentry cgroup implementation. These internal cgroups are initialized with default values that don't reflect the cgroup limits from outside of the sandbox. Admittedly we should initialize the sandbox's top level cgroup limits to the process's cgroup limits, but that's not in place right now.
cgroupv2 should still be enforcing the proper memory/cpu limit on the sandbox process externally since the parent slice has those limits set.
Sorry if there was confusion. Yes. Cgroups are being enforce by virtue of the parent pod cgroup. However, under cgroupv1 / non-systemd, the total memory and CPU were programmatically determined from the cgroups data here:
But this code doesn't work as expected with CgroupsV2 / systemd because, the processes belong to the pause container leaf cgroup, instead of the parent pod cgroup. The leaf node has both memory and cpu values as "max", indicating they should inherit from this parent. But the code just processes this as "unlimited" and sets the values to the full system memory and cpu count.
Ok, I see. I thought there was an issue with the enforcement of the limits. Thanks for working with me on this. I started a review of your draft PR #9575
The fixes were a little more complex than I originally thought and I wanted to add a couple unit tests, so I added the fixes in my own PR #9631. Thank you for your help with this issue.
Description
When running under Kubernetes w/ cgroups v2 (unified) + systemd, gVisor fails to detect the pod's memory limits and cpu quotas used to set the
--total-memory
and--cpu-num
flags.Under Kubernetes + cgroups v2 + systemd, gVisor launches all processes into the container subgroup associated with the pause container. This makes some sense given that cgroups v2 specifies that processes can only exist at leaf nodes, and the pod's cgroup is registered as a slice (an intermediate unit which cannot have its own processes) with systemd. When the sandbox is launched gVisor needs a container subgroup and the pause container is the first to be launched. The pause container is a child of the pod cgroup and therefore inherits the limits of the parent pod cgroup, BUT the child's controllers reflect the default
max
value. This in turn means that this code which reads the memory limit and cpu quota reads these as unlimited.https://github.com/google/gvisor/blob/8246598313a51a3c16664eebcaaa7e57a35afdbc/runsc/sandbox/sandbox.go#L1011-L1046
Steps to reproduce
Boot vanilla Debian Bookworm OS
Install containerd + apt repo stuff
Install gVisor & reconfigure containerd
https://gvisor.dev/docs/user_guide/install/
Install kubeadm + kubectl ( https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ )
Setup Kubernetes
Create a pod w/ limits
Pod should show 512M of memory and 2 CPUs, but shows total system memory and cpus.
runsc version
docker version (if using docker)
uname
kubectl (if using Kubernetes)
repo state (if built from source)
No response
runsc debug logs (if available)
No response