opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.78k stars 2.1k forks source link

[ci] make unittest and make integration broken on local machines #3955

Closed cyphar closed 1 year ago

cyphar commented 1 year ago

Description

It seems that some aspect of the cgroup setup for integration tests was broken for make integration and make unittest:

not ok 11 runc create (limits + cgrouppath + permission on the cgroup dir) succeeds
# (from function `check_cgroup_value' in file tests/integration/helpers.bash, line 267,
#  in test file tests/integration/cgroups.bats, line 56)
#   `check_cgroup_value "cgroup.controllers" "$(cat /sys/fs/cgroup/cgroup.controllers)"' failed
# runc spec (status=0):
#
# runc run -d --console-socket /tmp/bats-run-Q1ppq1/runc.cCqphr/tty/sock test_cgroups_permissions (status=0):
#
# current cpuset cpu pids !? cpuset cpu io memory hugetlb pids rdma misc
ok 12 runc exec (limits + cgrouppath + permission on the cgroup dir) succeeds
not ok 13 runc exec (cgroup v2 + init process in non-root cgroup) succeeds
# (in test file tests/integration/cgroups.bats, line 86)
#   `[[ ${lines[0]} == *"memory"* ]]' failed
# runc spec (status=0):
#
# runc run -d --console-socket /tmp/bats-run-Q1ppq1/runc.ZGbeOZ/tty/sock test_cgroups_group (status=0):
#
# runc exec test_cgroups_group cat /sys/fs/cgroup/cgroup.controllers (status=0):
# cpuset cpu pids
ok 14 runc run (cgroup v1 + unified resources should fail) # skip test requires cgroups_v1
not ok 15 runc run (blkio weight)
# (in test file tests/integration/cgroups.bats, line 142)
#   `[ "$status" -eq 0 ]' failed
# runc spec (status=0):
#
# runc run -d --console-socket /tmp/bats-run-Q1ppq1/runc.oXmFME/tty/sock test_cgroups_unified (status=1):
# time="2023-08-02T01:35:37Z" level=warning msg="unable to get oom kill count" error="openat2 /sys/fs/cgroup/runc-cgroups-integration-test/test-cgroup-22074/memory.events: no such
file or directory"
# time="2023-08-02T01:35:37Z" level=error msg="runc run failed: unable to start container process: unable to apply cgroup configuration: cannot enter cgroupv2 \"/sys/fs/cgroup/runc
-cgroups-integration-test\" with domain controllers -- it is in an invalid state"
# rmdir: failed to remove '/sys/fs/cgroup//runc-cgroups-integration-test': No such file or directory

(Most of the tests fail.)

Steps to reproduce the issue

  1. make unittest or make integration

Describe the results you received and expected

Tests should succeed on main, as per CI. They fail, as above.

What version of runc are you using?

main

Host OS information

NAME="openSUSE Tumbleweed"
# VERSION="20230731"
ID="opensuse-tumbleweed"
ID_LIKE="opensuse suse"
VERSION_ID="20230731"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:tumbleweed:20230731"
BUG_REPORT_URL="https://bugzilla.opensuse.org"
SUPPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Tumbleweed"
LOGO="distributor-logo-Tumbleweed"

Host kernel information

Linux senku 6.3.9-1-default #1 SMP PREEMPT_DYNAMIC Thu Jun 22 03:53:43 UTC 2023 (0df701d) x86_64 x86_64 x86_64 GNU/Linux

kolyshkin commented 1 year ago

This is probably because you're not running docker/podman as root. Not all cgroup controllers are available for docker/podman this way.

Something like this (taken from Vagrantfile.fedora may help:

# Delegate cgroup v2 controllers to rootless user via --systemd-cgroup
mkdir -p /etc/systemd/system/user@.service.d
cat > /etc/systemd/system/user@.service.d/delegate.conf << EOF
[Service]
# default: Delegate=pids memory
# NOTE: delegation of cpuset requires systemd >= 244 (Fedora >= 32, Ubuntu >= 20.04).
Delegate=yes
EOF
systemctl daemon-reload

But maybe it's (also?) something else. Will look tomorrow.

cyphar commented 1 year ago

My dockerd is definitely running as root, and we have Delegate=yes in the docker.service setup for openSUSE.

kolyshkin commented 1 year ago

Reproduced locally (very different setup from reporter's -- Fedora, Podman, sudo):

$ sudo make shell
....
root@38bc99e50653:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.controllers 
cpuset cpu io memory pids
root@38bc99e50653:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.subtree_control
root@38bc99e50653:/go/src/github.com/opencontainers/runc# echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
root@38bc99e50653:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset
root@38bc99e50653:/go/src/github.com/opencontainers/runc# echo +cpu > /sys/fs/cgroup/cgroup.subtree_control
root@38bc99e50653:/go/src/github.com/opencontainers/runc# echo +io > /sys/fs/cgroup/cgroup.subtree_control
bash: echo: write error: Operation not supported
root@38bc99e50653:/go/src/github.com/opencontainers/runc# echo +memory > /sys/fs/cgroup/cgroup.subtree_control
bash: echo: write error: Operation not supported
root@38bc99e50653:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu
root@38bc99e50653:/go/src/github.com/opencontainers/runc# echo +pids > /sys/fs/cgroup/cgroup.subtree_control
root@38bc99e50653:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu pids

So, in the container we're not allowed to delegate some cgroups. This most probably has to do with what systemd sets to cgroup.subtree_control.

More to say, systemd does not know about some controllers, so it does not allow them even when Delegate=yes is set. The following is on the host:

$ cat /sys/fs/cgroup/cgroup.controllers 
cpuset cpu io memory hugetlb pids rdma misc
$ cat /sys/fs/cgroup/cgroup.subtree_control 
cpuset cpu io memory hugetlb pids

That is not a problem per se, as long as dockerd/podman cgroup has cgroup.subtree_control contents identical to cgroup.controllers'. The way to check it would be to find dockerd pid, check its cgroup viacat /proc/$PID/cgroup, and then check that cgroup'scgroup.subtree_control`.

For me, I get:

$ pidof podman
1407576

$ cat /proc/1407576/cgroup 
0::/user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-2e0ee5be-4af4-41fe-81b8-8a82675e4472.scope

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-2e0ee5be-4af4-41fe-81b8-8a82675e4472.scope/cgroup.controllers 
cpuset cpu io memory pids

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-2e0ee5be-4af4-41fe-81b8-8a82675e4472.scope/cgroup.subtree_control 
cpu

$

Also, in my case, systemctl --user show vte-spawn-2e0ee5be-4af4-41fe-81b8-8a82675e4472.scope shows Delegate=no.

To fix that, I had to add this file:

$ cat /etc/systemd/user/vte-spawn-.scope.d/delegate.conf
[Scope]
Delegate=yes

and do

$ systemct --user daemon-reload

After that, in a new shell:

[kir@kir-rhat ~]$ cat /proc/self/cgroup 
0::/user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-1ad84904-ba4e-4866-9797-d450995c1aa9.scope
[kir@kir-rhat ~]$ systemctl --user show vte-spawn-1ad84904-ba4e-4866-9797-d450995c1aa9.scope | grep Dele
Delegate=yes
DelegateControllers=cpu cpuacct cpuset io blkio memory devices pids bpf-firewall bpf-devices bpf-foreign bpf-socket-bind bpf-restrict-network-interfaces
[kir@kir-rhat ~]$ cat /sys/fs/cgroup//user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-1ad84904-ba4e-4866-9797-d450995c1aa9.scope/cgroup.controllers 
cpuset cpu io memory pids
[kir@kir-rhat ~]$ cat /sys/fs/cgroup//user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-1ad84904-ba4e-4866-9797-d450995c1aa9.scope/cgroup.subtree_control 
[kir@kir-rhat ~]$ # ^^^ Still empty :(
[kir@kir-rhat ~]$ cat /sys/fs/cgroup//user.slice/user-1000.slice/user@1000.service/app.slice/cgroup.subtree_control 
cpuset cpu io memory pids
[kir@kir-rhat ~]$ # ^^^ Parent one is good though

and I still can't delegate memory controller for some reason:

[kir@kir-rhat runc]$ sudo make shell
...
root@a2ab8418acb4:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.controllers 
cpuset cpu io memory hugetlb pids
root@a2ab8418acb4:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.subtree_control  
root@a2ab8418acb4:/go/src/github.com/opencontainers/runc# echo +cpu > /sys/fs/cgroup/cgroup.subtree_control 
root@a2ab8418acb4:/go/src/github.com/opencontainers/runc# echo +memory > /sys/fs/cgroup/cgroup.subtree_control 
bash: echo: write error: Operation not supported
root@a2ab8418acb4:/go/src/github.com/opencontainers/runc# 
cyphar commented 1 year ago

I'm confused though -- in my case the container is being spawned with --privileged with a root daemon configured with Delegate=yes (and runc sets Delegate=yes for container cgroups as well AFAIK). I don't use rootless docker.

% cat /proc/$(pgrep dockerd)/cgroup
0::/system.slice/docker.service
% systemctl show docker.service | grep Delegate
Delegate=yes
DelegateControllers=cpu cpuacct cpuset io blkio memory devices pids bpf-firewall bpf-devices bpf-foreign bpf-socket-bind bpf-restrict-network-interfaces
% cat /sys/fs/cgroup/system.slice/docker.service/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
% cat /sys/fs/cgroup/system.slice/docker.service/cgroup.subtree_control
%

(That's not a typo -- there is nothing in subtree_control.)

Why is cgroup.subtree_control not including everything? Is this a systemd bug?

The container's scope is similarly configured:

% cat /proc/$pid1/cgroup
0::/system.slice/docker-72f09c7c55f7d9a80baca78f8a08875745ca023246547f2863f4d0722dc3dca6.scope
% sudo systemctl show docker-72f09c7c55f7d9a80baca78f8a08875745ca023246547f2863f4d0722dc3dca6.scope | grep Delegate
Delegate=yes
DelegateControllers=cpu cpuacct cpuset io blkio memory devices pids bpf-firewall bpf-devices bpf-foreign bpf-socket-bind bpf-restrict-network-interfaces
% cat /sys/fs/cgroup/system.slice/docker-72f09c7c55f7d9a80baca78f8a08875745ca023246547f2863f4d0722dc3dca6.scope/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
% cat /sys/fs/cgroup/system.slice/docker-72f09c7c55f7d9a80baca78f8a08875745ca023246547f2863f4d0722dc3dca6.scope/cgroup.subtree_control
%
cyphar commented 1 year ago

and I still can't delegate memory controller for some reason:

This is a cgroupfs restriction. cgroups that cannot be converted to threaded mode cannot have subtree delegated if there are processes in the cgroup:

static int cgroup_vet_subtree_control_enable(struct cgroup *cgrp, u16 enable)
{
    u16 domain_enable = enable & ~cgrp_dfl_threaded_ss_mask;

    /* if nothing is getting enabled, nothing to worry about */
    if (!enable)
        return 0;

    /* can @cgrp host any resources? */
    if (!cgroup_is_valid_domain(cgrp->dom_cgrp))
        return -EOPNOTSUPP;

    /* mixables don't care */
    if (cgroup_is_mixable(cgrp))
        return 0;

    if (domain_enable) {
        /* can't enable domain controllers inside a thread subtree */
        if (cgroup_is_thread_root(cgrp) || cgroup_is_threaded(cgrp))
            return -EOPNOTSUPP;
    } else {
        /*
         * Threaded controllers can handle internal competitions
         * and are always allowed inside a (prospective) thread
         * subtree.
         */
        if (cgroup_can_be_thread_root(cgrp) || cgroup_is_threaded(cgrp))
            return 0;
    }

    /*
     * Controllers can't be enabled for a cgroup with tasks to avoid
     * child cgroups competing against tasks.
     */
    if (cgroup_has_tasks(cgrp))
        return -EBUSY;

    return 0;
}

Basically, you can't add to the subtree set once the cgroup has processes except in some special cases.

kolyshkin commented 1 year ago

Basically, you can't add to the subtree set once the cgroup has processes except in some special cases.

Yes, figured that one out already. The workaround would be to start container init process in a sub-cgroup, and then change the top-level cgroup's cgroup.subtree_control.

I think we should do something like what is done in kind tool here: https://github.com/kubernetes-sigs/kind/commit/3c9c318eb85e4ce5c94422189ed5b1aa0d9f1e88.

Here's what I ended up with: https://github.com/opencontainers/runc/pull/3960.

kolyshkin commented 1 year ago

As a side note, I think we should need to add CI jobs that do make integration and make unittest (as currently in CI we only do make localintegration and make localunittest, so we do not test that test-in-docker works).

cyphar commented 1 year ago

I think we used to use Docker in CI and then switched it to be local after we split out the test runs into a proper matrix.

kolyshkin commented 1 year ago

OK, https://github.com/opencontainers/runc/pull/3960 is ready and (together with just-merged #3954) fixes this issue (on my laptop, that is)

kolyshkin commented 1 year ago

I think we used to use Docker in CI and then switched it to be local after we split out the test runs into a proper matrix.

One thing with testing inside Docker is, unless we can run systemd inside that testing container, we do not and can't test systemd-related functionality (systemd cgroup driver).

Having said that, we can add jobs to CI to make sure make integration unittest works via docker.

cyphar commented 1 year ago

3960 fixed the issue. We should make a separate PR to add make integration unittest to CI.