Closed tsde closed 1 year ago
The bug is reproduced on rke v1.4.8.
Steps:
docker restart kubelet
Result:
k8s_POD_
are restarting Node Info:
Hello @jiaqiluo Thanks for your feedback.
Is the fix I proposed in https://github.com/rancher/rke-tools/pull/164 is suitable to be released soon ? Maybe @kinarashah or @jakefhyde didn't have time to review yet ?
Some questions from QA to follow up on:
The fix is available since rke-tools
v0.1.92, so any k8s version that uses the rke-tools
image whose tag is v0.1.92 and higher should have the fix.
waiting for the next rke rc
This is ready to test on https://github.com/rancher/rke/releases/tag/v1.4.11-rc1
Reproduced w/ RKE v1.4.8
and k8s v1.24.8-rancher1-1
:
v1.4.8
, spin up a single-node cluster using k8s v1.24.8-rancher1-1
active
, ssh into the node and run docker restart kubelet
k8s_POD_
are restarted; unexpected behaviorVerifed w/ RKE v1.4.11-rc1
and k8s v1.26.8-rancher1-1
:
v1.4.11-rc1
, spin up a single-node cluster using k8s v1.27.6-rancher1-1
active
, ssh into the node and run docker restart kubelet
RKE version: v1.4.6
Docker version: (
docker version
,docker info
preferred) 20.10.23Docker info
Server: Containers: 16 Running: 9 Paused: 0 Stopped: 7 Images: 18 Server Version: 20.10.23 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: false userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2 Default Runtime: runc Init Binary: docker-init containerd version: 92b3a9d6f1b3bcc6dc74875cfdea653fe39f09c2 runc version: 81a44cf162f4409cc6ff656e2433b87321bf8a7a init version: Security Options: seccomp Profile: default cgroupns Kernel Version: 5.15.113-flatcar Operating System: Flatcar Container Linux by Kinvolk 3510.2.3 (Oklo) OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 5.764GiB Name: fc-test-01 ID: 7ZKL:S2NN:LZ6E:5747:QXU3:LV7E:6THC:ERRD:5ARO:5BKF:LL5Y:TDAV Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred) Flatcar Container Linux 3510.2.3 kernel5.15.113-flatcar
(also tested on Ubuntu 22.04)
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) VM (vSphere / VirtualBox)
cluster.yml file:
Steps to Reproduce:
docker restart kubelet
(you can also trigger a restart by modifying a kubelet setting incluster.yml
)Results: All pods running on the same node as the kubelet are restarted Note that this does not affect nodes using cgroup v1. Only cgroup v2 nodes are impacted (i.e. docker and containerd are properly configured to use cgroup v2)
What I expected A restart of kubelet should not impact pods running in the cluster
Observations I've had this issue for quite some time now but I couldn't find time to properly investigate. I was able to eventually pinpoint the root cause of this issue. It is caused by this piece of code in the
entrypoint.sh
script of therke-tools
image used to start the kubelet. This code is 5 years old now and is only relevant to cgroup v1.For quite some time now, kubelet is started with the
cgroups-per-qos
option set toTrue
. This implies that kubelet will create its own cgroup hierarchy under the root cgroup on cgroup v2 systems. You end up with a directory/sys/fs/cgroup/kubepods.slice
created by kubelet, then each QOS will have its own cgroup tree underneath it.The problem with the
entrypoint.sh
script shipped withrke-tools
(and mounted into kubelet container) is that each time kubelet restarts, a directory namedkubepods
is created in/sys/fs/cgroup/kubepods.slice
. This triggers a deletion of the wholekubepods.slice
hierarchy by thesystemd
process as seen in system logs :When kubelet comes back, it cannot find its cgroup hierarchy anymore and creates a new one and restarts all pods on the node.
How to fix The piece of code related to cgroup v1 in
entrypoint.sh
should be confined to only run on cgroup v1 systems. PR : https://github.com/rancher/rke-tools/pull/164EDIT: Added link to the
rke-tools
PRSURE-6766