google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.61k stars 1.28k forks source link

Resource consumption by python are not limited #10264

Closed chivalryq closed 3 months ago

chivalryq commented 5 months ago

Description

I'm building a sandbox service with gVisor. But the python seems to be able to apply unlimited memory while a bash script trying to apply unlimited memory are marked Error in Pod status.

Steps to reproduce

  1. Setup a kubernetes cluster with gVisor runtime class
  2. Apply the deploy below. It will try to apply 10GB memory.
cat << 'EOF' | kubectl apply -f -                                                                              
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: memory-eater-python
  name: memory-eater-python
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: memory-eater-python
  template:
    metadata:
      labels:
        app: memory-eater-python
    spec:
      containers:
      - command:
        - python
        args: ["-c", "import sys; big_list = []; print('Attempting to allocate 100GB of memory...'); [big_list.append(' ' * 10**6) for _ in range(100000)]"]
        image: python
        name: ubuntu
        securityContext:
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 999
        resources:
          limits:
            cpu: "1"
            memory: 512Mi
          requests:
            cpu: 200m
            ephemeral-storage: 200M
            memory: "214748364"
      dnsPolicy: Default
      hostNetwork: true
      restartPolicy: Always
      runtimeClassName: gvisor
EOF
  1. After a while, run kubectl top
    kubectl top pod -n default <pod-name>

I got the result. The memory is ~62GiB because in my pod because I'm trying to investigating why it makes our machine to be OOM. So, my pod apply ~100GiB memory.

NAME                                  CPU(cores)   MEMORY(bytes)
memory-eater-python-887b744f9-2snvs   984m         62654Mi
  1. As a negetive case, the bash script will be limited and pod will fail.
    cat << 'EOF' | kubectl apply -f -                                                                              
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    labels:
    app: memory-eater-bash
    name: memory-eater-bash
    namespace: default
    spec:
    replicas: 1
    selector:
    matchLabels:
      app: memory-eater-bash
    template:
    metadata:
      labels:
        app: memory-eater-bash
    spec:
      containers:
      - command:
        - bash
        - -c
        - big_var=data; while true; do big_var="$big_var$big_var"; done
        image: python
        name: ubuntu
        securityContext:
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 999
        resources:
          limits:
            cpu: "1"
            memory: 512Mi
          requests:
            cpu: 200m
            ephemeral-storage: 200M
            memory: "214748364"
      dnsPolicy: Default
      hostNetwork: true
      restartPolicy: Always
      runtimeClassName: gvisor
    EOF

runsc version

runsc version release-20231009.0
spec: 1.1.0-rc.1

docker version (if using docker)

No response

uname

Linux 3090-k8s-node029 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b", GitTreeState:"clean", BuildDate:"2023-02-22T13:32:22Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

repo state (if built from source)

No response

runsc debug logs (if available)

Haven't do it in the cluster
EtiennePerot commented 4 months ago

Hi; I can't seem to reproduce this, at least on GKE.

gVisor doesn't by itself do memory limiting; instead, it relies on the host Linux kernel to do this. It is set up here as part of container startup which eventually ends up here to control memory. This way, it limits both the total memory usage of the sum of the gVisor kernel and the processes within it with a single limit. If that goes over the limit, this should be killed by the Linux OOM killer, and this should be visible in dmesg on the machine.

The enforcement mechanism depends on many moving parts, so I suggest checking all of them.

If all of this is in place, please provide runsc debug logs, details on how you installed gVisor within the Kubernetes cluster (runsc flags etc.), systemd version (systemd --version), cgroup version (output of cat /proc/mounts), which cgroup controllers are enabled (cat /sys/fs/cgroup/cgroup.controllers).

Also please check #10371 which was filed recently after this issue and looks quite similar.

chivalryq commented 4 months ago

@EtiennePerot Thanks for replying! We have found the problem thanks to @charlie0129.

It ends up with that we didn't configure gvisor to use systemd-cgroup which is our cgroup manager in the cluster. After add systemd-cgroup and upgrade the gvisor to the latest version, the OOM pod is properly killed by Linux. If I understand it correctly the default option is to use cgroupfs which is not the mainstream. Would it be better to move to systemd-cgroup as a defualt?

But I don't seem to find any related document/FAQs about cgroup manager. Forgive me if I miss it. And if there is truly not any of them. It would be kind to mention it somewhere in document.

EtiennePerot commented 4 months ago

Would it be better to move to systemd-cgroup as a default?

See discussion on https://github.com/google/gvisor/issues/10371 on this. Apparently runc's default behavior is also systemd-group=false, and runsc needs to match runc behavior in order to remain a drop-in replacement for it. But +1 on the need for documentation.