nestybox / sysbox

An open-source, next-generation "runc" that empowers rootless containers to run workloads such as Systemd, Docker, Kubernetes, just like VMs.
Apache License 2.0
2.74k stars 150 forks source link

bug: sysbox-runc does not copy cpu.cfs_{quota,period}_us to syscont-cgroup-root #582

Open johnstcn opened 2 years ago

johnstcn commented 2 years ago

bug: sysbox-runc does not copy cpu.cfs_{quota_period}_us to syscont-cgroup-root

Summary

Similar to https://github.com/nestybox/sysbox/issues/303 sysbox-runc is not copying the cgroup CPU quota and limit values from the parent cgroup to syscont-cgroup-root. Processes running inside the container have no idea about how much CPU they have to work with.

Impact

Low. The container still gets limited, but the limit is opaque and not visible inside the container.

Steps to reproduce

Reproduced on sysbox-ce version 0.5.0 and version 0.5.2.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

$ uname -a
Linux bigred 5.13.0-51-generic #58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ docker info | grep Version
 Server Version: 20.10.17
 Cgroup Version: 1
 Kernel Version: 5.13.0-51-generic

# Docker's default runtime sets these values 
$ docker run -it --rm --memory=256M --cpu-quota=20000 --cpu-period=10000 alpine:latest
/ # cat /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_period_us
10000
/ # cat /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_quota_us
20000

# outside the container the limits are set on the cgroup
$ cat /sys/fs/cgroup/cpu,cpuacct/docker/9e50ac0b99b2510dfacee0205116d5f1c0dc2b5acddef2d25746fcfdc74a91c4/cpu.cfs_period_us
10000
$ cat /sys/fs/cgroup/cpu,cpuacct/docker/9e50ac0b99b2510dfacee0205116d5f1c0dc2b5acddef2d25746fcfdc74a91c4/cpu.cfs_quota_us
20000

# sysbox-runc runtime does not
$ docker run -it --rm --memory=256M --cpu-quota=20000 --cpu-period=10000 --runtime=sysbox-runc alpine:latest
/ # cat /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_quota_us
-1
/ # cat /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_period_us
100000
# outside the container the limits are set on the parent cgroup but not syscont-cgroup-root
$ cat /sys/fs/cgroup/cpu,cpuacct/docker/d3711f4abdbca7b0c016f373b7490d93dad6e042e9c5acfdd79ad272f12d37e1/cpu.cfs_period_us 
10000
$ cat /sys/fs/cgroup/cpu,cpuacct/docker/d3711f4abdbca7b0c016f373b7490d93dad6e042e9c5acfdd79ad272f12d37e1/cpu.cfs_quota_us 
20000

$ cat /sys/fs/cgroup/cpu,cpuacct/docker/d3711f4abdbca7b0c016f373b7490d93dad6e042e9c5acfdd79ad272f12d37e1/syscont-cgroup-root/cpu.cfs_period_us 
100000
$ cat /sys/fs/cgroup/cpu,cpuacct/docker/d3711f4abdbca7b0c016f373b7490d93dad6e042e9c5acfdd79ad272f12d37e1/syscont-cgroup-root/cpu.cfs_quota_us 
-1
ctalledo commented 2 years ago

Hi @johnstcn, thanks for filing the issue and for the excellent description.

The issue has a fairly simple fix, but I am not sure that it's always the "right thing" to do to make the cpu.cfs_[quota|period]_us values visible inside the container.

Without cgroup delegation (i.e., without a cgroup manager inside the container), it makes sense to expose these inside the container since there is a single cgroup manager at host level.

But with cgroup delegation (e.g., with a cgroup manager such as systemd inside the sysbox container), it's less clear because there is a valid argument for making this opaque inside the container so as to fool the cgroup manager into thinking it has full control of the cpu bandwidth (even though it's constrained by the parent cgroup).

Since Sysbox containers are most often used as "system containers", we made the decision to go with the latter approach. But maybe we need a knob to control the behavior.

Curious on your thoughts about this?

ctalledo commented 2 years ago

Processes running inside the container have no idea about how much CPU they have to work with.

Out of curiosity, are you aware of any programs that look into the cpu.cfs_[quota|period]_us?

johnstcn commented 2 years ago

I am not sure that it's always the "right thing" to do to make the cpu.cfs_[quota|period]_us values visible inside the container.

In a container environment (e.g. Kubernetes), not respecting CPU limits can cause an application to not respond to liveness probes and be killed just as if it had consumed too much memory (with a different smoking gun in each case, of course). This is especially true on more powerful systems with e.g. 64 or more logical cores -- an application may decide to spin up that many separate threads to do its work without knowing that it's "effectively" constrained to far fewer.

Since Sysbox containers are most often used as "system containers", we made the decision to go with the latter approach. But maybe we need a knob to control the behavior.

Agreed, it does seem that there are some competing use cases here with different behaviours.

Out of curiosity, are you aware of any programs that look into the cpu.cfs_[quota|period]_us?

Yes! Best example I can think of is the OpenJDK JVM (since JDK-8146115, which seems to have been backported to all major JVM versions in use). For example, amazoncoretto:8 (1.8.0_342), openjdk:8 (1.8.0_332), and ibmjava:8 (1.8.0_331) are all aware of CFS quota and period at the time of writing, which the default runc runtime makes visible:

$ docker run -it --rm --memory=256M --cpu-quota=20000 --cpu-period=10000 --runtime=runc amazoncorretto:8 java -XshowSettings:system -version
Operating System Metrics:
    Provider: cgroupv1
    Effective CPU Count: 2
    CPU Period: 10000us
    CPU Quota: 20000us
    [...]

$ runc --version
runc version 1.1.2
commit: v1.1.2-0-ga916309
spec: 1.0.2-dev
go: go1.17.11
libseccomp: 2.5.1

It's likely other JREs support this, but I haven't tested all of them. There's more recent work in supporting CGroupv2 as well (link).

Back in Go-land, there's also an open issue for the Go runtime to automatically set GOMAXPROCS depending on the CFS quota/period. In the interim, there exist libraries such as go.uber.org/automaxprocs to fill this gap.

ctalledo commented 2 years ago

Thanks @johnstcn for the detailed response.

Not arguing against anything you said, but the thing that still confuses me a bit about applications that rely on the CFS quota/period is that cgroups are hierarchical in nature, and therefore it seems to me an app can't really tell how much CPU is has without looking at the corresponding limits on it's parent (and further ancestor) cgroups.

For example, if the parent cgroup limits the app to 25% CPU via CFS quota/period, then it does not matter if the cgroup for the app itself is configured with 200% CPU quota/period; it won't go over the 25% limit imposed by the parent.

Now when such an app runs in a container, it won't even have access to the cgroup of it's parent (and further ancestors), so I am not sure how can a clear determination be made.

Does this make sense or am I missing something basic here?

johnstcn commented 2 years ago

@ctalledo Yep, no arguments from me there either :-) Just what you said there also applies to the memory.limit_in_bytes, so the current behaviour is inconsistent either way. I can really see why the kernel folks removed hierarchies in cgroupv2!