Resource consumption by python are not limited

chivalryq commented 5 months ago

Description

I'm building a sandbox service with gVisor. But the python seems to be able to apply unlimited memory while a bash script trying to apply unlimited memory are marked Error in Pod status.

Steps to reproduce

Setup a kubernetes cluster with gVisor runtime class
Apply the deploy below. It will try to apply 10GB memory.

cat << 'EOF' | kubectl apply -f -                                                                              
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: memory-eater-python
  name: memory-eater-python
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: memory-eater-python
  template:
    metadata:
      labels:
        app: memory-eater-python
    spec:
      containers:
      - command:
        - python
        args: ["-c", "import sys; big_list = []; print('Attempting to allocate 100GB of memory...'); [big_list.append(' ' * 10**6) for _ in range(100000)]"]
        image: python
        name: ubuntu
        securityContext:
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 999
        resources:
          limits:
            cpu: "1"
            memory: 512Mi
          requests:
            cpu: 200m
            ephemeral-storage: 200M
            memory: "214748364"
      dnsPolicy: Default
      hostNetwork: true
      restartPolicy: Always
      runtimeClassName: gvisor
EOF

After a while, run kubectl top
```
kubectl top pod -n default <pod-name>
```

I got the result. The memory is ~62GiB because in my pod because I'm trying to investigating why it makes our machine to be OOM. So, my pod apply ~100GiB memory.

NAME                                  CPU(cores)   MEMORY(bytes)
memory-eater-python-887b744f9-2snvs   984m         62654Mi

As a negetive case, the bash script will be limited and pod will fail.

cat << 'EOF' | kubectl apply -f -                                                                              
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: memory-eater-bash
name: memory-eater-bash
namespace: default
spec:
replicas: 1
selector:
matchLabels:
  app: memory-eater-bash
template:
metadata:
  labels:
    app: memory-eater-bash
spec:
  containers:
  - command:
    - bash
    - -c
    - big_var=data; while true; do big_var="$big_var$big_var"; done
    image: python
    name: ubuntu
    securityContext:
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 999
    resources:
      limits:
        cpu: "1"
        memory: 512Mi
      requests:
        cpu: 200m
        ephemeral-storage: 200M
        memory: "214748364"
  dnsPolicy: Default
  hostNetwork: true
  restartPolicy: Always
  runtimeClassName: gvisor
EOF

runsc version

runsc version release-20231009.0
spec: 1.1.0-rc.1

docker version (if using docker)

No response

uname

Linux 3090-k8s-node029 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b", GitTreeState:"clean", BuildDate:"2023-02-22T13:32:22Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

repo state (if built from source)

No response

runsc debug logs (if available)

Haven't do it in the cluster

EtiennePerot commented 4 months ago

Hi; I can't seem to reproduce this, at least on GKE.

gVisor doesn't by itself do memory limiting; instead, it relies on the host Linux kernel to do this. It is set up here as part of container startup which eventually ends up here to control memory. This way, it limits both the total memory usage of the sum of the gVisor kernel and the processes within it with a single limit. If that goes over the limit, this should be killed by the Linux OOM killer, and this should be visible in dmesg on the machine.

The enforcement mechanism depends on many moving parts, so I suggest checking all of them.

The OOM killer must be enabled on the host Linux kernel.
cgroupfs must be mounted on the host (typically at /sys/fs/cgroup).
Note that cgroupfs comes into two versions (v1 and v2) which changes things quite a bit.
Make sure runsc's --ignore-cgroups flag is not specified.
If you use runsc's --systemd-cgroup, make sure you have systemd >= v244.
The Linux.CgroupsPath may need to be set properly in the OCI spec. It is probably incorrect (but need debug logs to check)
The gVisor shim can set the dev.gvisor.spec.cgroup-parent annotation to set the cgroups path as well (this would show up in debug logs).

If all of this is in place, please provide runsc debug logs, details on how you installed gVisor within the Kubernetes cluster (runsc flags etc.), systemd version (systemd --version), cgroup version (output of cat /proc/mounts), which cgroup controllers are enabled (cat /sys/fs/cgroup/cgroup.controllers).

Also please check #10371 which was filed recently after this issue and looks quite similar.

chivalryq commented 4 months ago

@EtiennePerot Thanks for replying! We have found the problem thanks to @charlie0129.

It ends up with that we didn't configure gvisor to use systemd-cgroup which is our cgroup manager in the cluster. After add systemd-cgroup and upgrade the gvisor to the latest version, the OOM pod is properly killed by Linux. If I understand it correctly the default option is to use cgroupfs which is not the mainstream. Would it be better to move to systemd-cgroup as a defualt?

But I don't seem to find any related document/FAQs about cgroup manager. Forgive me if I miss it. And if there is truly not any of them. It would be kind to mention it somewhere in document.

EtiennePerot commented 4 months ago

Would it be better to move to systemd-cgroup as a default?

See discussion on https://github.com/google/gvisor/issues/10371 on this. Apparently runc's default behavior is also systemd-group=false, and runsc needs to match runc behavior in order to remain a drop-in replacement for it. But +1 on the need for documentation.

google / gvisor