cgroups under some controllers are not cleaned up after rkt pods exit

yifan-gu commented 8 years ago

Environment

rkt Version: 1.4.0+gitb7589f9 appc Version: 0.7.4 Go Version: go1.5.3 Go OS/Arch: linux/amd64 Features: +TPM Distro: CoreOS, Ubuntu, Gentoo

What did you do?

Run rkt pods via systemd service and cmd line:

$ cat /run/systemd/system/test-rkt.service
[Service]
ExecStart=/tmp/rkt --insecure-options=image,ondisk run docker://busybox

$ sudo systemctl start test-rkt.service

$ sudo systemctl status test-rkt.service
$ sudo systemctl status test-rkt
● test-rkt.service
   Loaded: loaded (/run/systemd/system/test-rkt.service; static; vendor preset: enabled)
   Active: inactive (dead)

Apr 25 16:48:28 yifan-coreos rkt[16101]: networking: loading networks from /etc/rkt/net.d
Apr 25 16:48:28 yifan-coreos rkt[16101]: networking: loading network default with type ptp
Apr 25 16:48:29 yifan-coreos systemd[1]: Stopped test-rkt.service.
Apr 25 17:04:51 yifan-coreos systemd[1]: Started test-rkt.service.
Apr 25 17:04:51 yifan-coreos systemd[1]: Starting test-rkt.service...
Apr 25 17:04:51 yifan-coreos rkt[16794]: image: using image from local store for image name coreos.com/rkt/stage1-coreos:1.3.0+gitb7589f9
Apr 25 17:04:51 yifan-coreos rkt[16794]: image: using image from local store for url docker://busybox
Apr 25 17:04:52 yifan-coreos rkt[16794]: networking: loading networks from /etc/rkt/net.d
Apr 25 17:04:52 yifan-coreos rkt[16794]: networking: loading network default with type ptp
Apr 25 17:04:52 yifan-coreos systemd[1]: Stopped test-rkt.service.

What did you expect to see?

We should see cgroups under all controllers are removed after the pod exits

What did you see instead?

$ find -name test-rkt.service
./freezer/system.slice/test-rkt.service
./hugetlb/system.slice/test-rkt.service
./blkio/system.slice/test-rkt.service
./memory/system.slice/test-rkt.service
./cpu,cpuacct/system.slice/test-rkt.service
./net_cls,net_prio/system.slice/test-rkt.service
./perf_event/system.slice/test-rkt.service
./cpuset/system.slice/test-rkt.service

Besides, if I run the rkt using cmd line, the cgroups are not cleaned up as well:

$ rkt --insecure-options=image,ondisk run docker://busybox --exec=/bin/echo -- hello
image: using image from file /home/yifan/bin/stage1-coreos.aci
image: using image from local store for url docker://busybox
networking: loading networks from /etc/rkt/net.d
networking: loading network default with type ptp
[20286.946506] echo[4]: hello
$ ls /sys/fs/cgroup/perf_event/machine.slice/ $ ls -l /sys/fs/cgroup/perf_event/machine.slice/
total 0
-rw-r--r-- 1 root root 0 Apr 26 00:39 cgroup.clone_children
-rw-r--r-- 1 root root 0 Apr 26 00:39 cgroup.procs
drwxr-xr-x 3 root root 0 Apr 26 00:45 machine-rkt\x2d21be894b\x2d5b2c\x2d4072\x2dbd7a\x2d8810107c5ade.scope
-rw-r--r-- 1 root root 0 Apr 26 00:39 notify_on_release
-rw-r--r-- 1 root root 0 Apr 26 00:39 tasks

But, on ubuntu (with systemd 216 on host), running through cmd line works fine. The cgroups get cleaned up after the pods exit.

alban commented 8 years ago

Interesting that /sys/fs/cgroup/systemd gets cleaned up but not the other subsystems.

euank commented 8 years ago

I did an strace and noticed a series of rmdir("/sys/fs/cgroup/cpu/machine.slice/machine-rkt$UID.scope/system.slice/opt-stage2-busybox-rootfs-sys-fs-cgroup-freezer.mount") = -1 EROFS (Read-only file system) errors. I'm not certain it's related, but it seems quite plausible it is.

Manually running the rmdirs that errored out as such after the fact does succeed.

alban commented 8 years ago

We should check if the same behavior happen with systemd-nspawn (without rkt). The "cpu" subsystem is mounted read-only by systemd-nspawn.

euank commented 8 years ago

It doesn't appear to leak with just nspawn. I have an alpine linux rootfs and did the following ...

$ lscgroup | wc -l                               
488
# Terminal 2
$ sudo systemd-nspawn          
tmp-alpine:~# 
# Terminal 1
$ lscgroup | wc -l
493
# Kill alpine
$ lscgroup | wc -l
488

iaguis commented 8 years ago

To clean up empty cgroup directories, systemd sets the knobs notify_on_release to 1 and release_agent to /usr/lib/systemd/systemd-cgroups-agent in /sys/fs/cgroup/systemd. With these knobs set, when the last task of a cgroup leaves and the last child cgroup of that cgroup is removed, the binary specified in release_agent gets called with the pathname of the abandoned cgroup as a parameter (see cgroups.txt). Then, systemd-cgroups--agent deals with removing the empty cgroups.

This is only done for the systemd cgroup hierarchy, systemd-nspawn doesn't touch anything under the container cgroup subtree for other hierarchies so that's enough.

However, rkt does create cgroups under the container cgroup subtree which never get removed because notify_on_release is set to 0 and release_agent is empty for the hierarchies that are not systemd.

While notify_on_release appears in every cgroup directory, release_agent is only present in the root of the hierarchy, so setting that to some sort of rkt-release-agent seems too invasive.

Is cleaning up the cgroups something we really need?

iaguis commented 8 years ago

I read a bit more and

While remounting cgroups is currently supported, it is not recommend to use it. Remounting allows changing bound subsystems and release_agent. Rebinding is hardly useful as it only works when the hierarchy is empty and release_agent itself should be replaced with conventional fsnotify. The support for remounting will be removed in the future. To Specify a hierarchy's release_agent:

mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \

xxx /sys/fs/cgroup/rg1

So I thought we could mount the cgroups inside the pod (or on the host in rkt's mount namespace) with a rkt release agent so it doesn't affect the rest of the system but it doesn't seem to work. Also, "it's not recommended".

yifan-gu commented 8 years ago

@iaguis Thanks for the explanation :+1:
Then I think we should at least remove those cgroups when rkt gc and rkt rm