mgoltzsche / podman-static

static podman binaries and container image
Apache License 2.0
253 stars 60 forks source link

Can't limit memory for nested container #66

Closed ilusharulkov closed 1 year ago

ilusharulkov commented 1 year ago

Hello! Thx for your great job, this is awesome! I have a problem with memory restriction for nested container. For example, let's create a golang app which consumes 400mb ram:

package main
import "fmt"
func main() {
    lim := 400 << 20
    mem := make([]byte, lim)
    for i := 0; i < lim; i++ {
        mem[i] = '0'
    }
    fmt.Println("400mb")
}

compile and run it:

go build -o 400 ./main.go && command time --verbose ./400 2>&1 >/dev/null | grep "Maximum resident set size (kbytes)"

The output is (on my machine):

Maximum resident set size (kbytes): 422192

which is 412.29 mb

Now, run this app in container, using minimal tag:

docker run --privileged --rm -w /workdir -v ./400:/workdir/400 mgoltzsche/podman:minimal \
podman run -v /workdir/400:/bin/400 -m 100m docker.io/alpine /bin/400

The output is (on my machine):

Trying to pull docker.io/library/alpine:latest...
Getting image source signatures
Copying blob sha256:7264a8db6415046d36d16ba98b79778e18accee6ffa71850405994cffa9be7de
Copying config sha256:7e01a0d0a1dcd9e539f8e9bbd80106d59efbdf97293b3d38f5d7a34501526cdb
Writing manifest to image destination
400mb

The app was executed correctly and printed 400mb, but limit was -m 100m.

Howewer If we will use image 4.6.1 the app exited with 137 code, which is (i guess) correct.

command time --verbose \
docker run --privileged --rm -w /workdir -v ./400:/workdir/400 mgoltzsche/podman:4.6.1 \
podman run -v /workdir/400:/bin/400 -m 100m docker.io/alpine /bin/400 \
2>&1 >/dev/null | grep "Exit status"

The output is: Exit status: 137

Let's rise limit (100 mb -> 500mb):

command time --verbose \
docker run --privileged --rm -w /workdir -v ./400:/workdir/400 mgoltzsche/podman:4.6.1 \
podman run -v /workdir/400:/bin/400 -m 500m docker.io/alpine /bin/400 \
2>&1 >/dev/null | grep "Exit status"

The output is: Exit status: 0

My question is: Why memory limit is ignored when using minimal tag image?

mgoltzsche commented 1 year ago

Why memory limit is ignored when using minimal tag image?

Because the minimal image is configured to use the host's cgroup namespace instead of creating a new one for the container, see here. (For this reason the minimal image comes with crun instead of runc as container runtime.) This is to be able to run containers within environments where you don't have permissions to create new namespaces, e.g. within another container. Thus, if you want to set resource limits, you need to use the non-minimal image (or use its containers.conf with the minimal image).