Open yifan-gu opened 8 years ago
Interesting that /sys/fs/cgroup/systemd
gets cleaned up but not the other subsystems.
I did an strace and noticed a series of rmdir("/sys/fs/cgroup/cpu/machine.slice/machine-rkt$UID.scope/system.slice/opt-stage2-busybox-rootfs-sys-fs-cgroup-freezer.mount") = -1 EROFS (Read-only file system)
errors. I'm not certain it's related, but it seems quite plausible it is.
Manually running the rmdir
s that errored out as such after the fact does succeed.
We should check if the same behavior happen with systemd-nspawn (without rkt). The "cpu" subsystem is mounted read-only by systemd-nspawn.
It looks similar to: https://bugs.freedesktop.org/show_bug.cgi?id=68370
It doesn't appear to leak with just nspawn. I have an alpine linux rootfs and did the following ...
$ lscgroup | wc -l
488
# Terminal 2
$ sudo systemd-nspawn
tmp-alpine:~#
# Terminal 1
$ lscgroup | wc -l
493
# Kill alpine
$ lscgroup | wc -l
488
To clean up empty cgroup directories, systemd sets the knobs notify_on_release
to 1
and release_agent
to /usr/lib/systemd/systemd-cgroups-agent
in /sys/fs/cgroup/systemd
. With these knobs set, when the last task of a cgroup leaves and the last child cgroup of that cgroup is removed, the binary specified in release_agent
gets called with the pathname of the abandoned cgroup as a parameter (see cgroups.txt). Then, systemd-cgroups--agent
deals with removing the empty cgroups.
This is only done for the systemd
cgroup hierarchy, systemd-nspawn doesn't touch anything under the container cgroup subtree for other hierarchies so that's enough.
However, rkt does create cgroups under the container cgroup subtree which never get removed because notify_on_release
is set to 0 and release_agent
is empty for the hierarchies that are not systemd
.
While notify_on_release
appears in every cgroup directory, release_agent
is only present in the root of the hierarchy, so setting that to some sort of rkt-release-agent
seems too invasive.
Is cleaning up the cgroups something we really need?
I read a bit more and
While remounting cgroups is currently supported, it is not recommend to use it. Remounting allows changing bound subsystems and release_agent. Rebinding is hardly useful as it only works when the hierarchy is empty and release_agent itself should be replaced with conventional fsnotify. The support for remounting will be removed in the future. To Specify a hierarchy's release_agent:
mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
xxx /sys/fs/cgroup/rg1
So I thought we could mount the cgroups inside the pod (or on the host in rkt's mount namespace) with a rkt release agent so it doesn't affect the rest of the system but it doesn't seem to work. Also, "it's not recommended".
@iaguis Thanks for the explanation :+1:
Then I think we should at least remove those cgroups when rkt gc
and rkt rm
to answer the Q
Is cleaning up the cgroups something we really need?
yes for cadvisor as cadvisor sees them and with @philips idea https://github.com/google/cadvisor/issues/1255#issuecomment-216694222 we will see the cgroup and wait thinking its a rkt pod.
yes, its definitely a problem for cadvisor.
imagine we have a service that runs intermittently. If its not running when cadvisor sees it originally, it will just use the raw cgroup handler (along with delaying startup). So even when it comes up it wont provide all the information one expects. Also, even if it starts correctly as a rkt handler, when the service ends, it will still be there as a rkt handler and hence will also be providing wrong information.
@iaguis is there a good reason that a release-agent is not set for every hierarchy?
or to rephrase, what's the problem if we set it, everything has notify_on_release set to 0 so nothing will change for them, it would only change for our stuff?
Is cleaning up the cgroups something we really need?
yes for cadvisor as cadvisor sees them and with @philips idea https://github.com/google/cadvisor/issues/1255#issuecomment-216694222 we will see the cgroup and wait thinking its a rkt pod.
Since @philips' idea requires to fetch the unit properties on dbus already, cadvisor could also get & monitor the unit status via dbus, and if the unit is stopped, then it knows it does not need to wait for the pid to appear. Would that work?
we'd never know if it restarted, as the inotify based system wouldn't help (working on an alternate system, but its a bit of surgery)
we'd never know if it restarted, as the inotify based system wouldn't help
cadvisor would need to register to dbus notifications when the unit is stopped & started.
cadvisor would need to register to dbus notifications when the unit is stopped & started.
that's seems more pain that its worth. seemingly it be easier to just have a way to get notifications from the rkt api service about all the running pods.
so yea, I verified that if I restart a systemd unit after cadvisor started, it doesn't see it. Then again, it never deleted it so it works but not correctly (i.e. not looking at right pid for networking stats and file system for disk stats). Honestly it feels like a bug to not clean up cgroups right away
FWIW, this is causing kubelet/cadvisor to achieve super high CPU usage (~full cores) over the time because cadvisor keeps listing all those remaining cgroups.
we should configure cadvisor to use the dockerOnly flag (or soon to hopefully be runtimeOnly)
Capturing discussion from sync: initial action is to ensure we clean up these cgroups on rkt gc and/or rkt remove
/subscribe
Partially fixed by #2655. We'll leave this open for a possible fix without having to wait for GC.
Environment
rkt Version: 1.4.0+gitb7589f9 appc Version: 0.7.4 Go Version: go1.5.3 Go OS/Arch: linux/amd64 Features: +TPM Distro: CoreOS, Ubuntu, Gentoo
What did you do?
Run rkt pods via systemd service and cmd line:
What did you expect to see?
We should see cgroups under all controllers are removed after the pod exits
What did you see instead?
Besides, if I run the rkt using cmd line, the cgroups are not cleaned up as well:
But, on ubuntu (with systemd 216 on host), running through cmd line works fine. The cgroups get cleaned up after the pods exit.