Closed cyphar closed 1 year ago
This is probably because you're not running docker/podman as root. Not all cgroup controllers are available for docker/podman this way.
Something like this (taken from Vagrantfile.fedora
may help:
# Delegate cgroup v2 controllers to rootless user via --systemd-cgroup
mkdir -p /etc/systemd/system/user@.service.d
cat > /etc/systemd/system/user@.service.d/delegate.conf << EOF
[Service]
# default: Delegate=pids memory
# NOTE: delegation of cpuset requires systemd >= 244 (Fedora >= 32, Ubuntu >= 20.04).
Delegate=yes
EOF
systemctl daemon-reload
But maybe it's (also?) something else. Will look tomorrow.
My dockerd is definitely running as root, and we have Delegate=yes in the docker.service setup for openSUSE.
Reproduced locally (very different setup from reporter's -- Fedora, Podman, sudo):
$ sudo make shell
....
root@38bc99e50653:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory pids
root@38bc99e50653:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.subtree_control
root@38bc99e50653:/go/src/github.com/opencontainers/runc# echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
root@38bc99e50653:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset
root@38bc99e50653:/go/src/github.com/opencontainers/runc# echo +cpu > /sys/fs/cgroup/cgroup.subtree_control
root@38bc99e50653:/go/src/github.com/opencontainers/runc# echo +io > /sys/fs/cgroup/cgroup.subtree_control
bash: echo: write error: Operation not supported
root@38bc99e50653:/go/src/github.com/opencontainers/runc# echo +memory > /sys/fs/cgroup/cgroup.subtree_control
bash: echo: write error: Operation not supported
root@38bc99e50653:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu
root@38bc99e50653:/go/src/github.com/opencontainers/runc# echo +pids > /sys/fs/cgroup/cgroup.subtree_control
root@38bc99e50653:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu pids
So, in the container we're not allowed to delegate some cgroups. This most probably has to do with what systemd sets to cgroup.subtree_control
.
More to say, systemd does not know about some controllers, so it does not allow them even when Delegate=yes
is set. The following is on the host:
$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
$ cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory hugetlb pids
That is not a problem per se, as long as dockerd/podman cgroup has cgroup.subtree_control
contents identical to cgroup.controllers'. The way to check it would be to find dockerd pid, check its cgroup via
cat /proc/$PID/cgroup, and then check that cgroup's
cgroup.subtree_control`.
For me, I get:
$ pidof podman
1407576
$ cat /proc/1407576/cgroup
0::/user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-2e0ee5be-4af4-41fe-81b8-8a82675e4472.scope
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-2e0ee5be-4af4-41fe-81b8-8a82675e4472.scope/cgroup.controllers
cpuset cpu io memory pids
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-2e0ee5be-4af4-41fe-81b8-8a82675e4472.scope/cgroup.subtree_control
cpu
$
Also, in my case, systemctl --user show vte-spawn-2e0ee5be-4af4-41fe-81b8-8a82675e4472.scope
shows Delegate=no
.
To fix that, I had to add this file:
$ cat /etc/systemd/user/vte-spawn-.scope.d/delegate.conf
[Scope]
Delegate=yes
and do
$ systemct --user daemon-reload
After that, in a new shell:
[kir@kir-rhat ~]$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-1ad84904-ba4e-4866-9797-d450995c1aa9.scope
[kir@kir-rhat ~]$ systemctl --user show vte-spawn-1ad84904-ba4e-4866-9797-d450995c1aa9.scope | grep Dele
Delegate=yes
DelegateControllers=cpu cpuacct cpuset io blkio memory devices pids bpf-firewall bpf-devices bpf-foreign bpf-socket-bind bpf-restrict-network-interfaces
[kir@kir-rhat ~]$ cat /sys/fs/cgroup//user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-1ad84904-ba4e-4866-9797-d450995c1aa9.scope/cgroup.controllers
cpuset cpu io memory pids
[kir@kir-rhat ~]$ cat /sys/fs/cgroup//user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-1ad84904-ba4e-4866-9797-d450995c1aa9.scope/cgroup.subtree_control
[kir@kir-rhat ~]$ # ^^^ Still empty :(
[kir@kir-rhat ~]$ cat /sys/fs/cgroup//user.slice/user-1000.slice/user@1000.service/app.slice/cgroup.subtree_control
cpuset cpu io memory pids
[kir@kir-rhat ~]$ # ^^^ Parent one is good though
and I still can't delegate memory controller for some reason:
[kir@kir-rhat runc]$ sudo make shell
...
root@a2ab8418acb4:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids
root@a2ab8418acb4:/go/src/github.com/opencontainers/runc# cat /sys/fs/cgroup/cgroup.subtree_control
root@a2ab8418acb4:/go/src/github.com/opencontainers/runc# echo +cpu > /sys/fs/cgroup/cgroup.subtree_control
root@a2ab8418acb4:/go/src/github.com/opencontainers/runc# echo +memory > /sys/fs/cgroup/cgroup.subtree_control
bash: echo: write error: Operation not supported
root@a2ab8418acb4:/go/src/github.com/opencontainers/runc#
I'm confused though -- in my case the container is being spawned with --privileged
with a root daemon configured with Delegate=yes
(and runc sets Delegate=yes
for container cgroups as well AFAIK). I don't use rootless docker.
% cat /proc/$(pgrep dockerd)/cgroup
0::/system.slice/docker.service
% systemctl show docker.service | grep Delegate
Delegate=yes
DelegateControllers=cpu cpuacct cpuset io blkio memory devices pids bpf-firewall bpf-devices bpf-foreign bpf-socket-bind bpf-restrict-network-interfaces
% cat /sys/fs/cgroup/system.slice/docker.service/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
% cat /sys/fs/cgroup/system.slice/docker.service/cgroup.subtree_control
%
(That's not a typo -- there is nothing in subtree_control
.)
Why is cgroup.subtree_control
not including everything? Is this a systemd bug?
The container's scope is similarly configured:
% cat /proc/$pid1/cgroup
0::/system.slice/docker-72f09c7c55f7d9a80baca78f8a08875745ca023246547f2863f4d0722dc3dca6.scope
% sudo systemctl show docker-72f09c7c55f7d9a80baca78f8a08875745ca023246547f2863f4d0722dc3dca6.scope | grep Delegate
Delegate=yes
DelegateControllers=cpu cpuacct cpuset io blkio memory devices pids bpf-firewall bpf-devices bpf-foreign bpf-socket-bind bpf-restrict-network-interfaces
% cat /sys/fs/cgroup/system.slice/docker-72f09c7c55f7d9a80baca78f8a08875745ca023246547f2863f4d0722dc3dca6.scope/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
% cat /sys/fs/cgroup/system.slice/docker-72f09c7c55f7d9a80baca78f8a08875745ca023246547f2863f4d0722dc3dca6.scope/cgroup.subtree_control
%
and I still can't delegate memory controller for some reason:
This is a cgroupfs restriction. cgroups that cannot be converted to threaded mode cannot have subtree delegated if there are processes in the cgroup:
static int cgroup_vet_subtree_control_enable(struct cgroup *cgrp, u16 enable)
{
u16 domain_enable = enable & ~cgrp_dfl_threaded_ss_mask;
/* if nothing is getting enabled, nothing to worry about */
if (!enable)
return 0;
/* can @cgrp host any resources? */
if (!cgroup_is_valid_domain(cgrp->dom_cgrp))
return -EOPNOTSUPP;
/* mixables don't care */
if (cgroup_is_mixable(cgrp))
return 0;
if (domain_enable) {
/* can't enable domain controllers inside a thread subtree */
if (cgroup_is_thread_root(cgrp) || cgroup_is_threaded(cgrp))
return -EOPNOTSUPP;
} else {
/*
* Threaded controllers can handle internal competitions
* and are always allowed inside a (prospective) thread
* subtree.
*/
if (cgroup_can_be_thread_root(cgrp) || cgroup_is_threaded(cgrp))
return 0;
}
/*
* Controllers can't be enabled for a cgroup with tasks to avoid
* child cgroups competing against tasks.
*/
if (cgroup_has_tasks(cgrp))
return -EBUSY;
return 0;
}
Basically, you can't add to the subtree set once the cgroup has processes except in some special cases.
Basically, you can't add to the subtree set once the cgroup has processes except in some special cases.
Yes, figured that one out already. The workaround would be to start container init process in a sub-cgroup, and then change the top-level cgroup's cgroup.subtree_control
.
I think we should do something like what is done in kind tool here: https://github.com/kubernetes-sigs/kind/commit/3c9c318eb85e4ce5c94422189ed5b1aa0d9f1e88.
Here's what I ended up with: https://github.com/opencontainers/runc/pull/3960.
As a side note, I think we should need to add CI jobs that do make integration
and make unittest
(as currently in CI we only do make localintegration
and make localunittest
, so we do not test that test-in-docker works).
I think we used to use Docker in CI and then switched it to be local after we split out the test runs into a proper matrix.
OK, https://github.com/opencontainers/runc/pull/3960 is ready and (together with just-merged #3954) fixes this issue (on my laptop, that is)
I think we used to use Docker in CI and then switched it to be local after we split out the test runs into a proper matrix.
One thing with testing inside Docker is, unless we can run systemd inside that testing container, we do not and can't test systemd-related functionality (systemd cgroup driver).
Having said that, we can add jobs to CI to make sure make integration unittest
works via docker.
make integration unittest
to CI.
Description
It seems that some aspect of the cgroup setup for integration tests was broken for
make integration
andmake unittest
:(Most of the tests fail.)
Steps to reproduce the issue
make unittest
ormake integration
Describe the results you received and expected
Tests should succeed on
main
, as per CI. They fail, as above.What version of runc are you using?
main
Host OS information
Host kernel information
Linux senku 6.3.9-1-default #1 SMP PREEMPT_DYNAMIC Thu Jun 22 03:53:43 UTC 2023 (0df701d) x86_64 x86_64 x86_64 GNU/Linux