gVisor fails to detect memory/cpu w/ systemd+cgroupsv2

jcodybaker commented 1 year ago

Description

When running under Kubernetes w/ cgroups v2 (unified) + systemd, gVisor fails to detect the pod's memory limits and cpu quotas used to set the --total-memory and --cpu-num flags.

Under Kubernetes + cgroups v2 + systemd, gVisor launches all processes into the container subgroup associated with the pause container. This makes some sense given that cgroups v2 specifies that processes can only exist at leaf nodes, and the pod's cgroup is registered as a slice (an intermediate unit which cannot have its own processes) with systemd. When the sandbox is launched gVisor needs a container subgroup and the pause container is the first to be launched. The pause container is a child of the pod cgroup and therefore inherits the limits of the parent pod cgroup, BUT the child's controllers reflect the default max value. This in turn means that this code which reads the memory limit and cpu quota reads these as unlimited.

https://github.com/google/gvisor/blob/8246598313a51a3c16664eebcaaa7e57a35afdbc/runsc/sandbox/sandbox.go#L1011-L1046

$ pwd
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda5d4fbc7_30d6_43e8_8425_8ee6f401decc.slice

$ ls
# NOTE: I've omitted the controllers from this list to keep the ticket more concise.
cri-containerd-a991eb3b64a228cb0b269925127c28cb39d482d27c6248540e0e3b41633a907e.scope

$ cat memory.max
536870912

$ cat cgroup.procs
# Empty 

$ cat cri-containerd-a991eb3b64a228cb0b269925127c28cb39d482d27c6248540e0e3b41633a907e.scope/memory.max
max

$ cat cri-containerd-a991eb3b64a228cb0b269925127c28cb39d482d27c6248540e0e3b41633a907e.scope/cgroup.procs
3041197
3041201
3041229
3041230
3041242
3041244
3041307
3041309

# ctr shows the container subgroup holding the processes belongs to the pause container
$ ctr -n k8s.io c ls | grep a991eb3b64a228cb0b269925127c28cb39d482d27c6248540e0e3b41633a907e
a991eb3b64a228cb0b269925127c28cb39d482d27c6248540e0e3b41633a907e    registry.k8s.io/pause:3.6                          io.containerd.runsc.v1

Steps to reproduce

Boot vanilla Debian Bookworm OS

Install containerd + apt repo stuff

apt-get install -y containerd

Install gVisor & reconfigure containerd

https://gvisor.dev/docs/user_guide/install/

(
  set -e
  ARCH=$(uname -m)
  URL=https://storage.googleapis.com/gvisor/releases/release/latest/${ARCH}
  wget ${URL}/runsc ${URL}/runsc.sha512 \
    ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512
  sha512sum -c runsc.sha512 \
    -c containerd-shim-runsc-v1.sha512
  rm -f *.sha512
  chmod a+rx runsc containerd-shim-runsc-v1
  sudo mv runsc containerd-shim-runsc-v1 /usr/local/bin
)

cat << 'EOF' > /etc/containerd/config.toml
version = 2

[debug]
  level = "debug"
[plugins."io.containerd.runtime.v1.linux"]
  shim_debug = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
    runtime_type = "io.containerd.runc.v2"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc-ptrace]
    runtime_type = "io.containerd.runsc.v1"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc-ptrace.options]
        TypeUrl = "io.containerd.runsc.v1.options"
        ConfigPath = "/etc/containerd/runsc-ptrace.toml"
        SystemdCgroup = true
EOF

cat << 'EOF' > /etc/containerd/runsc-ptrace.toml
log_path = "/var/log/runsc/%ID%/shim.log"
log_level = "debug"
[runsc_config]
  debug = "true"
  debug-log = "/var/log/runsc/%ID%/gvisor.%COMMAND%.log"
  systemd-cgroup = "true"
  cpu-num-from-quota = "true"
EOF

systemctl restart containerd

Install kubeadm + kubectl ( https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ )

apt-get install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.28/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.28/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
apt-get update
apt-get install -y kubelet kubeadm kubectl
apt-mark hold kubelet kubeadm kubectl

Setup Kubernetes

# kubeadm-config.yaml
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kubernetesVersion: v1.28.2
networking:
  podSubnet: 192.168.0.0/16
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd
EOF

sysctl net.ipv4.ip_forward=1
modprobe br_netfilter
kubeadm init --config kubeadm.conf 
export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.3/manifests/tigera-operator.yaml
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.3/manifests/custom-resources.yaml

cat << 'EOF' | kubectl create -f -
apiVersion: node.k8s.io/v1
handler: runsc-ptrace
kind: RuntimeClass
metadata:
  creationTimestamp: "2023-10-06T20:31:46Z"
  name: gvisor-ptrace
EOF

Create a pod w/ limits

cat << 'EOF' | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: example
  name: example
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: example
  template:
    metadata:
      labels:
        app: example
    spec:
      containers:
      - command:
        - sleep
        - infinity
        image: ubuntu:jammy
        name: ubuntu
        resources:
          limits:
            cpu: "1"
            memory: 512Mi
          requests:
            cpu: 200m
            ephemeral-storage: 200M
            memory: "214748364"
      dnsPolicy: Default
      hostNetwork: true
      restartPolicy: Always
      runtimeClassName: gvisor-ptrace
      tolerations:
      - operator: Exists
EOF

Pod should show 512M of memory and 2 CPUs, but shows total system memory and cpus.

kubectl exec $(kubectl get pods -o name) -- sh -c 'free -m && cat /proc/cpuinfo'

kubectl exec $(kubectl get pods -o name) -- sh -c 'free -m && cat /proc/cpuinfo'
               total        used        free      shared  buff/cache   available
Mem:            7941           1        7934           0           5        7934
Swap:              0           0           0
...
processor       : 3
...

runsc version

$ runsc --version
runsc version release-20231016.0
spec: 1.1.0-rc.1

docker version (if using docker)

containerd --version
containerd github.com/containerd/containerd 1.6.20~ds1 1.6.20~ds1-1+b1

uname

Linux gvisor2 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux

kubectl (if using Kubernetes)

$ kubectl version
Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.2

$ kubectl get nodes
NAME      STATUS   ROLES           AGE   VERSION
gvisor2   Ready    control-plane   10m   v1.28.3

repo state (if built from source)

No response

runsc debug logs (if available)

No response

jcodybaker commented 1 year ago

I tried addressing this in code here: https://github.com/google/gvisor/pull/9575

If this approach seems viable, I can add tests and formalize the PR. Or feel free copy and adapt it.

manninglucas commented 1 year ago

Thanks for the detailed report! IIUC we will have to walk the full cgroup path to calculate the real memory limit and cpu quotas. Otherwise there may be cases where we miss a limit set higher up in the hierarchy. If you modify your draft PR to do this I can review it and we can merge it in.

jcodybaker commented 1 year ago

Hi Lucas, thanks for the quick attention to this.

With regard to walking the cgroup tree, I'm open to that but worry it won't produce a stable result. For example, if the gVisor pod is kubernetes qos burstable (cpu/memory req != limit), the pod will get a pod slice cgroup in /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice. When a guaranteed QOS pod (limit == request) is scheduled to the node, the kubelet it gets placed directly in /sys/fs/cgroup/kubepods.slice/ and the kubelet will reduce /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/memory.max accordingly. Since --total-memory and --num-cpus are only configured at the initial sandbox create these will become stale.

My expectation folks want the --total-memory and --num-cpus values to reflect pod limits, even in cases when those limits could not be satisfied. Similarly, pods without limits could find themselves in a place where a guaranteed pod is removed, allocating more memory back to the parent kubepods.sliace and causing making more memory/cpu available than is listed in the initial --total-memory value.

manninglucas commented 1 year ago

Ah yep, you are right. I think the issue is actually with how we're setting up our sytemd cgroupv2 paths initially. I should have a PR that will patch this soon.

jcodybaker commented 1 year ago

I'm curious what you come up with there. I initially had similar thoughts after seeing what the shim does for parent cgroups in the non-systemd cgroups v1 code. In that case pids are bound to the parent pod cgroup itself, rather than the CRI. In that case the child cgroups per-container are instantiated, but it seems to largely service cadvisor--specifically the limits related metrics like container_spec_memory_limit_bytes and friends.

That doesn't seem workable in systemd because the pod cgroup is a slice unit (which cannot hold processes). Since the container id / cgroups path for the workload container(s) (non-pause containers) isn't known when the sandbox is launched, there's not much choice but to put the sandbox into the subgroup for the pause container. I suppose the pids could be moved to a new cgroup for the workload container(s) but with multiple containers possible that seemed not-ideal.

Finding a replacement for those cadvisor limit metrics is the next task on my list, so if there's a way to make that work in the process I'm interested.

manninglucas commented 1 year ago

Could you share your debug logs from running the container? From your containerd config it should be in /var/log/runsc/%ID%/gvisor.%COMMAND%.log

jcodybaker commented 1 year ago

https://gist.github.com/jcodybaker/132f8e1f0ea118b2fc9d879ca28a35da

manninglucas commented 1 year ago

After a bit more digging, the issue here seems to be that your command kubectl exec $(kubectl get pods -o name) -- sh -c 'free -m && cat /proc/cpuinfo' runs inside the sandbox and reads from the sandbox-internal sentry cgroup implementation. These internal cgroups are initialized with default values that don't reflect the cgroup limits from outside of the sandbox. Admittedly we should initialize the sandbox's top level cgroup limits to the process's cgroup limits, but that's not in place right now.

cgroupv2 should still be enforcing the proper memory/cpu limit on the sandbox process externally since the parent slice has those limits set.

jcodybaker commented 1 year ago

Sorry if there was confusion. Yes. Cgroups are being enforce by virtue of the parent pod cgroup. However, under cgroupv1 / non-systemd, the total memory and CPU were programmatically determined from the cgroups data here:

https://github.com/google/gvisor/blob/ba53672288fb20c6ded46010c148ff6fadbe4556/runsc/sandbox/sandbox.go#L1010-L1045

But this code doesn't work as expected with CgroupsV2 / systemd because, the processes belong to the pause container leaf cgroup, instead of the parent pod cgroup. The leaf node has both memory and cpu values as "max", indicating they should inherit from this parent. But the code just processes this as "unlimited" and sets the values to the full system memory and cpu count.

manninglucas commented 1 year ago

Ok, I see. I thought there was an issue with the enforcement of the limits. Thanks for working with me on this. I started a review of your draft PR #9575

manninglucas commented 1 year ago

The fixes were a little more complex than I originally thought and I wanted to add a couple unit tests, so I added the fixes in my own PR #9631. Thank you for your help with this issue.

google / gvisor