rkt / rkt

[Project ended] rkt is a pod-native container engine for Linux. It is composable, secure, and built on standards.
Apache License 2.0
8.82k stars 885 forks source link

cgroups under some controllers are not cleaned up after rkt pods exit #2504

Open yifan-gu opened 8 years ago

yifan-gu commented 8 years ago

Environment

rkt Version: 1.4.0+gitb7589f9 appc Version: 0.7.4 Go Version: go1.5.3 Go OS/Arch: linux/amd64 Features: +TPM Distro: CoreOS, Ubuntu, Gentoo

What did you do?

Run rkt pods via systemd service and cmd line:

$ cat /run/systemd/system/test-rkt.service
[Service]
ExecStart=/tmp/rkt --insecure-options=image,ondisk run docker://busybox

$ sudo systemctl start test-rkt.service

$ sudo systemctl status test-rkt.service
$ sudo systemctl status test-rkt
● test-rkt.service
   Loaded: loaded (/run/systemd/system/test-rkt.service; static; vendor preset: enabled)
   Active: inactive (dead)

Apr 25 16:48:28 yifan-coreos rkt[16101]: networking: loading networks from /etc/rkt/net.d
Apr 25 16:48:28 yifan-coreos rkt[16101]: networking: loading network default with type ptp
Apr 25 16:48:29 yifan-coreos systemd[1]: Stopped test-rkt.service.
Apr 25 17:04:51 yifan-coreos systemd[1]: Started test-rkt.service.
Apr 25 17:04:51 yifan-coreos systemd[1]: Starting test-rkt.service...
Apr 25 17:04:51 yifan-coreos rkt[16794]: image: using image from local store for image name coreos.com/rkt/stage1-coreos:1.3.0+gitb7589f9
Apr 25 17:04:51 yifan-coreos rkt[16794]: image: using image from local store for url docker://busybox
Apr 25 17:04:52 yifan-coreos rkt[16794]: networking: loading networks from /etc/rkt/net.d
Apr 25 17:04:52 yifan-coreos rkt[16794]: networking: loading network default with type ptp
Apr 25 17:04:52 yifan-coreos systemd[1]: Stopped test-rkt.service.

What did you expect to see?

We should see cgroups under all controllers are removed after the pod exits

What did you see instead?

$ find -name test-rkt.service
./freezer/system.slice/test-rkt.service
./hugetlb/system.slice/test-rkt.service
./blkio/system.slice/test-rkt.service
./memory/system.slice/test-rkt.service
./cpu,cpuacct/system.slice/test-rkt.service
./net_cls,net_prio/system.slice/test-rkt.service
./perf_event/system.slice/test-rkt.service
./cpuset/system.slice/test-rkt.service

Besides, if I run the rkt using cmd line, the cgroups are not cleaned up as well:

$ rkt --insecure-options=image,ondisk run docker://busybox --exec=/bin/echo -- hello
image: using image from file /home/yifan/bin/stage1-coreos.aci
image: using image from local store for url docker://busybox
networking: loading networks from /etc/rkt/net.d
networking: loading network default with type ptp
[20286.946506] echo[4]: hello
$ ls /sys/fs/cgroup/perf_event/machine.slice/ $ ls -l /sys/fs/cgroup/perf_event/machine.slice/
total 0
-rw-r--r-- 1 root root 0 Apr 26 00:39 cgroup.clone_children
-rw-r--r-- 1 root root 0 Apr 26 00:39 cgroup.procs
drwxr-xr-x 3 root root 0 Apr 26 00:45 machine-rkt\x2d21be894b\x2d5b2c\x2d4072\x2dbd7a\x2d8810107c5ade.scope
-rw-r--r-- 1 root root 0 Apr 26 00:39 notify_on_release
-rw-r--r-- 1 root root 0 Apr 26 00:39 tasks

But, on ubuntu (with systemd 216 on host), running through cmd line works fine. The cgroups get cleaned up after the pods exit.

alban commented 8 years ago

Interesting that /sys/fs/cgroup/systemd gets cleaned up but not the other subsystems.

euank commented 8 years ago

I did an strace and noticed a series of rmdir("/sys/fs/cgroup/cpu/machine.slice/machine-rkt$UID.scope/system.slice/opt-stage2-busybox-rootfs-sys-fs-cgroup-freezer.mount") = -1 EROFS (Read-only file system) errors. I'm not certain it's related, but it seems quite plausible it is.

Manually running the rmdirs that errored out as such after the fact does succeed.

alban commented 8 years ago

We should check if the same behavior happen with systemd-nspawn (without rkt). The "cpu" subsystem is mounted read-only by systemd-nspawn.

It looks similar to: https://bugs.freedesktop.org/show_bug.cgi?id=68370

euank commented 8 years ago

It doesn't appear to leak with just nspawn. I have an alpine linux rootfs and did the following ...

$ lscgroup | wc -l                               
488
# Terminal 2
$ sudo systemd-nspawn          
tmp-alpine:~# 
# Terminal 1
$ lscgroup | wc -l
493
# Kill alpine
$ lscgroup | wc -l
488
iaguis commented 8 years ago

To clean up empty cgroup directories, systemd sets the knobs notify_on_release to 1 and release_agent to /usr/lib/systemd/systemd-cgroups-agent in /sys/fs/cgroup/systemd. With these knobs set, when the last task of a cgroup leaves and the last child cgroup of that cgroup is removed, the binary specified in release_agent gets called with the pathname of the abandoned cgroup as a parameter (see cgroups.txt). Then, systemd-cgroups--agent deals with removing the empty cgroups.

This is only done for the systemd cgroup hierarchy, systemd-nspawn doesn't touch anything under the container cgroup subtree for other hierarchies so that's enough.

However, rkt does create cgroups under the container cgroup subtree which never get removed because notify_on_release is set to 0 and release_agent is empty for the hierarchies that are not systemd.

While notify_on_release appears in every cgroup directory, release_agent is only present in the root of the hierarchy, so setting that to some sort of rkt-release-agent seems too invasive.

Is cleaning up the cgroups something we really need?

iaguis commented 8 years ago

I read a bit more and

While remounting cgroups is currently supported, it is not recommend to use it. Remounting allows changing bound subsystems and release_agent. Rebinding is hardly useful as it only works when the hierarchy is empty and release_agent itself should be replaced with conventional fsnotify. The support for remounting will be removed in the future. To Specify a hierarchy's release_agent:

mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \

xxx /sys/fs/cgroup/rg1

So I thought we could mount the cgroups inside the pod (or on the host in rkt's mount namespace) with a rkt release agent so it doesn't affect the rest of the system but it doesn't seem to work. Also, "it's not recommended".

yifan-gu commented 8 years ago

@iaguis Thanks for the explanation :+1:
Then I think we should at least remove those cgroups when rkt gc and rkt rm

sjpotter commented 8 years ago

to answer the Q

Is cleaning up the cgroups something we really need?

yes for cadvisor as cadvisor sees them and with @philips idea https://github.com/google/cadvisor/issues/1255#issuecomment-216694222 we will see the cgroup and wait thinking its a rkt pod.

sjpotter commented 8 years ago

yes, its definitely a problem for cadvisor.

imagine we have a service that runs intermittently. If its not running when cadvisor sees it originally, it will just use the raw cgroup handler (along with delaying startup). So even when it comes up it wont provide all the information one expects. Also, even if it starts correctly as a rkt handler, when the service ends, it will still be there as a rkt handler and hence will also be providing wrong information.

sjpotter commented 8 years ago

@iaguis is there a good reason that a release-agent is not set for every hierarchy?

or to rephrase, what's the problem if we set it, everything has notify_on_release set to 0 so nothing will change for them, it would only change for our stuff?

alban commented 8 years ago

Is cleaning up the cgroups something we really need?

yes for cadvisor as cadvisor sees them and with @philips idea https://github.com/google/cadvisor/issues/1255#issuecomment-216694222 we will see the cgroup and wait thinking its a rkt pod.

Since @philips' idea requires to fetch the unit properties on dbus already, cadvisor could also get & monitor the unit status via dbus, and if the unit is stopped, then it knows it does not need to wait for the pid to appear. Would that work?

sjpotter commented 8 years ago

we'd never know if it restarted, as the inotify based system wouldn't help (working on an alternate system, but its a bit of surgery)

alban commented 8 years ago

we'd never know if it restarted, as the inotify based system wouldn't help

cadvisor would need to register to dbus notifications when the unit is stopped & started.

sjpotter commented 8 years ago

cadvisor would need to register to dbus notifications when the unit is stopped & started.

that's seems more pain that its worth. seemingly it be easier to just have a way to get notifications from the rkt api service about all the running pods.

sjpotter commented 8 years ago

so yea, I verified that if I restart a systemd unit after cadvisor started, it doesn't see it. Then again, it never deleted it so it works but not correctly (i.e. not looking at right pid for networking stats and file system for disk stats). Honestly it feels like a bug to not clean up cgroups right away

yifan-gu commented 8 years ago

FWIW, this is causing kubelet/cadvisor to achieve super high CPU usage (~full cores) over the time because cadvisor keeps listing all those remaining cgroups.

sjpotter commented 8 years ago

we should configure cadvisor to use the dockerOnly flag (or soon to hopefully be runtimeOnly)

jonboulle commented 8 years ago

Capturing discussion from sync: initial action is to ensure we clean up these cgroups on rkt gc and/or rkt remove

tmrts commented 8 years ago

/subscribe

iaguis commented 8 years ago

Partially fixed by #2655. We'll leave this open for a possible fix without having to wait for GC.