Closed ashcrow closed 6 years ago
Note that this reversal only happens within docker. Outside of docker, I see /sys/fs/cgroup/cpuacct,cpu
.
@crawford interesting. Thanks for adding that.
@derekwaynecarr does this truly look like the same issue you had seen before?
@crawford / @derekwaynecarr:
Do we have a good understanding of how hard this will be to fix? I know @derekwaynecarr noted he's looked at this before and thought it has been fixed already.
To notify those on this issue work on trying to identify the issue has started.
Here's what I found so far with Docker version 1.13.1 RHEL7
to see if the error persists there:
$ vagrant box add --name RHCOS rhcos-vagrant-libvirt.box
$ mdkir rhcos && cd rhcos && vagrant init RHCOS && vagrant up
$ vagrant ssh
Link to Vagrant box binary: http://aos-ostree.rhev-ci-vms.eng.rdu2.redhat.com/rhcos/images/cloud/latest/
RPM Overlaying
$ sudo ostree admin unlock --hotfix
$ rpm -qa | grep docker
Docker version 1.13.1-70 RHEL7
$ sudo rpm-ostree override replace *.rpm
$ sudo rpm-ostree status -v
$ sudo reboot
Ran the following commands to start the Kublet:
/usr/bin/docker \
run \
--rm \
--net host \
--pid host \
--privileged \
--volume /dev:/dev:rw \
--volume /sys:/sys:ro \
--volume /var/run:/var/run:rw \
--volume /var/lib/cni/:/var/lib/cni:rw \
--volume /var/lib/docker/:/var/lib/docker:rw \
--volume /var/lib/kubelet/:/var/lib/kubelet:shared \
--volume /var/log:/var/log:shared \
--volume /etc/kubernetes:/etc/kubernetes:ro \
--entrypoint /usr/bin/hyperkube \
"openshift/origin-node" \
kubelet \
--bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--rotate-certificates \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--cni-bin-dir=/var/lib/cni/bin \
--network-plugin=cni \
--lock-file=/var/run/lock/kubelet.lock \
--exit-on-lock-contention \
--pod-manifest-path=/etc/kubernetes/manifests \
--allow-privileged \
--node-labels=node-role.kubernetes.io/master \
--minimum-container-ttl-duration=6m0s \
--cluster-dns=10.3.0.10 \
--cluster-domain=cluster.local \
--client-ca-file=/etc/kubernetes/ca.crt \
--anonymous-auth=false \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
Which gave me the following output:
Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --allow-privileged has been deprecated, will be removed in a future version
Flag --minimum-container-ttl-duration has been deprecated, Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.
Flag --cluster-dns has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cluster-domain has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/
for more information.
Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/
for more information.
I0705 23:32:18.768616 2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:18.768902 2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:19.036018 2186 server.go:415] Version: v1.11.0+d4cacc0
I0705 23:32:19.036062 2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:19.036110 2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:19.036123 2186 server.go:493] acquiring file lock on "/var/run/lock/kubelet.lock"
I0705 23:32:19.036150 2186 server.go:498] watching for inotify events for: /var/run/lock/kubelet.lock
I0705 23:32:19.036262 2186 plugins.go:97] No cloud provider specified.
W0705 23:32:19.036290 2186 server.go:556] standalone mode, no API client
F0705 23:32:19.036300 2186 server.go:262] failed to run Kubelet: No authentication method configured
So it looks the error with cgroup
isn't showing with this docker version, unless my reproduction steps are incorrect.
Updated: look at comments below, the tests stated in this comment was insufficient to identify the problem
I've encountered an error when Mounting NFS shared folders, i.e. at /vagrant
and running exportfs -a -v
doesn't change anything. @cgwalters may have already fixed this error as suggested by @peterbaouoft
The full error log in this gist.
@Bubblemelon this makes me wonder if the fix was applied at build time via a patch. It may be worth using rpmdev-extract
to take a look at the contents of the SRPM and see what patches (if any) are applied.
Using this Libvirt howto guide to verify the assumptions in my above comment about docker's cgroup
driver:
Master Node Info
RHCOS version: source
[core@coreos-220-master-0 ~]$ rpm-ostree status -v
State: idle; auto updates disabled
Deployments:
● ostree://rhcos:openshift/3.10/x86_64/os
Version: 3.10-7.5.235 (2018-07-06 22:41:39)
Commit: f51faab9a702e0d85905f3edc81641a63c9ec3c8acf0319e52d03de03de67e5f
└─ atomic-centos-continuous (2018-07-06 20:45:09)
└─ dustymabe-ignition (2018-07-03 00:29:34)
└─ rhcos-continuous (2018-07-06 19:25:38)
└─ rhel-7.5-server (2018-05-02 10:10:39)
└─ rhel-7.5-server-optional (2018-05-02 10:06:54)
└─ rhel-7.5-server-extras (2018-05-02 13:57:35)
└─ rhel-7.5-atomic (2017-07-11 17:45:34)
└─ openshift (2018-07-06 21:46:14)
Staged: no
StateRoot: rhcos
Docker Version: 2018-04-30 15:56:58
[core@coreos-220-master-0 ~]$ rpm -qa | grep docker
docker-client-1.13.1-63.git94f4240.el7.x86_64
docker-rhel-push-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-common-1.13.1-63.git94f4240.el7.x86_64
docker-1.13.1-63.git94f4240.el7.x86_64
docker-novolume-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-lvm-plugin-1.13.1-63.git94f4240.el7.x86_64
Output from $ journalctl -u docker
:
Jul 09 17:51:20 coreos-220-master-0 dockerd-current[1145]: F0709 17:51:20.232817 25049 server.go:262] failed to run Kubelet:
failed to create kubelet: misconfiguration:
kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"
Jul 09 17:51:20 coreos-220-master-0 dockerd-current[1145]: time="2018-07-09T17:51:20.267459141Z" level=error msg="containerd:
deleting container" error="exit status 1: \"container b85300b2eee4b379bec5753361f37e1
1bcb8cacdd7c4aa6c9179d62eb93ab001 does not exist\\none or more of the container deletions failed\\n\""
Jul 09 17:51:20 coreos-220-master-0 dockerd-current[1145]: time="2018-07-09T17:51:20.298990686Z" level=warning msg="b85300b2eee4b379bec5753361f37e11bcb8cacdd7c4aa6c9179d62eb93ab001 cleanup: failed to unmount sec
rets: invalid argument"
In trying to resolve cgroupfs is different from docker cgroup driver: systemd
error:
I found this openshift issue #18776:
To place
ExecStart=/usr/bin/dockerd \
--exec-opt native.cgroupdriver=systemd
within docker.service
. However the /usr
directory is read only and docker.service
already contains the following:
[core@coreos-220-master-0 system]$ cat docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target rhel-push-plugin.socket registries.service
Wants=docker-storage-setup.service
Requires=rhel-push-plugin.socket registries.service
Requires=docker-cleanup.timer
[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=-/run/containers/registries.conf
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/etc/sysconfig/docker-network
Environment=GOTRACEBACK=crash
Environment=DOCKER_HTTP_HOST_COMPAT=1
Environment=PATH=/usr/libexec/docker:/usr/bin:/usr/sbin
ExecStart=/usr/bin/dockerd-current \
--add-runtime docker-runc=/usr/libexec/docker/docker-runc-current \
--default-runtime=docker-runc \
--authorization-plugin=rhel-push-plugin \
--exec-opt native.cgroupdriver=systemd \ <--------------
--userland-proxy-path=/usr/libexec/docker/docker-proxy-current \
--init-path=/usr/libexec/docker/docker-init-current \
--seccomp-profile=/etc/docker/seccomp.json \
$OPTIONS \
$DOCKER_STORAGE_OPTIONS \
$DOCKER_NETWORK_OPTIONS \
$ADD_REGISTRY \
$BLOCK_REGISTRY \
$INSECURE_REGISTRY \
$REGISTRIES
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
TimeoutStartSec=0
Restart=on-abnormal
KillMode=process
[Install]
WantedBy=multi-user.target
The error above
failed to create kubelet: misconfiguration:
kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"
can be resolved by adding --cgroup-driver=systemd \
to kubelet.service
:
[Unit]
Description=Kubernetes Kubelet
...
[Service]
...
ExecStart=/usr/bin/docker \
run \
.
.
"openshift/origin-node:latest" \
kubelet \
.
.
.
--cgroup-driver=systemd \
After running, sudo systemctl daemon-reload && sudo systemctl restart kubelet
:
journalctl -u docker
and journalctl -u kubelet
shows the same output:
kubelet.go:1769] skipping pod synchronization - [container runtime is down]
kubelet_node_status.go:269] Setting node annotation to enable volume controller attach/detach
kubelet.go:1312] Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory
kubelet.service: main process exited, code=exited, status=255/n/a
Unit kubelet.service entered failed state.
kubelet.service failed.
kubelet_node_status.go:79] Attempting to register node coreos-220-master-0
kubelet.go:1312] Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory
kubelet.service: main process exited, code=exited, status=255/n/a
Unit kubelet.service entered failed state.
kubelet.service failed.
$ rpm -qa | grep docker
docker-client-1.13.1-63.git94f4240.el7.x86_64
docker-rhel-push-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-common-1.13.1-63.git94f4240.el7.x86_64
docker-1.13.1-63.git94f4240.el7.x86_64
docker-novolume-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-lvm-plugin-1.13.1-63.git94f4240.el7.x86_64
Great work debugging @Bubblemelon!
Also thank you @crawford for helping me!
Just to clarify, something on the kubelet
side is causing the Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory
I've also tried it out with this docker version: source - Sun, 08 Jul 2018 09:39:40 UT
docker-1.13.1-72.git6f36bd4.el7.x86_64
docker-rhel-push-plugin-1.13.1-72.git6f36bd4.el7.x86_64
docker-client-1.13.1-72.git6f36bd4.el7.x86_64
docker-lvm-plugin-1.13.1-72.git6f36bd4.el7.x86_64
docker-common-1.13.1-72.git6f36bd4.el7.x86_64
docker-novolume-plugin-1.13.1-72.git6f36bd4.el7.x86_64
Which gave the same error.
Like to note that openshift/origin-node:latest
i.e. openshift v3.11.0-alpha.0+90e2736-260
is running Kubernetes v1.11.0+d4cacc0
.
That version of kubelet should include this fix
@derekwaynecarr what are your thoughts on this?
cadivor doesn't like /sys:/sys:ro. See https://github.com/google/cadvisor/issues/1843
This same error,
kubelet.go:1312] Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory
Still occurs when /sys
is changed to read and write within the kubelet.service
file.
.
.
ExecStart=/usr/bin/docker \
run \
.
.
--volume /sys:/sys:rw \
.
Note that on RHCOS, the file is in this format: /sys/fs/cgroup/cpu,cpuacct
If both of these were added, under ExecStart=/usr/bin/docker \
--volume /sys:/sys:rw \
--volume=/sys/fs/cgroup/cpu,cpuacct:/sys/fs/cgroup/cpuacct,cpu:rw \
This error would occur:
kubelet.service holdoff time over, scheduling restart.
Starting Kubernetes Kubelet...
Started Kubernetes Kubelet.
container_linux.go:247: starting container process caused "process_linux.go:364: container init caused
\"rootfs_linux.go:54: mounting \\\"/sys/fs/cgroup/cpu,cpuacct\\\" to rootfs
\\\"/var/lib/docker/overlay2/8c95a16f4cad1f014091093c62248c6c0f27bcde879606cef6220f7db4521708/
merged\\\" at \\\"/var/lib/docker/overlay2/8c95a16f4cad1f014091093c62248c6c0f27bcde879606cef6220f7db4521708/
merged/sys/fs/cgroup/cpuacct,cpu\\\" caused \\\"no space left on device\\\"\""
/usr/bin/docker-current: Error response from daemon: oci runtime error: Failed to remove paths:
map[cpu:/sys/fs/cgroup/cpu,cpuacct/system.slice/docker-afc3a2d6c323ed28a6c7e6586239cb4db8b79b591513eb229ca6fa1eb0bead3b.scope
cpuacct:/sys/fs/cgroup/cpu,cpuacct/system.slice/docker-afc3a2d6c323ed28a6c7e6586239cb4db8b79b591513eb229ca6fa1eb0bead3b.scope].
@crawford do you mind stating what priority you think this should have? Or if the workaround in use should be applied in the RHCOS spins itself? This would clarify if @Bubblemelon and @mrunalp should keep digging on this specific issue.
This needs to be fixed in the Kubelet. If the OS team is going to tackle that, then I think this bug should stay. Otherwise, let's close this and let @derekwaynecarr and his team tackle the issue. Either way, this is a low priority. I have a workaround (it's ugly, but it works).
Since this is kubelet related we should pass it over to @derekwaynecarr's team and link back to this issue so they don't have to re-do all of the good debugging done so far.
Moved this issue over to openshift/origin
Closing since the fix must be done in another codebase.
@crawford has found in tests that
/sys/fs/cgroup/cpuacct,cpu
is being expected during his testing but RHCOS provides/sys/fs/cgroup/cpu,cpuacct
.https://github.com/kubernetes/kubernetes/issues/32728#issuecomment-252469277 denotes a similar issue. The workaround is to setup a link from one to the other.