On cgroup v2 systems, a restart of kubelet container triggers a restart of all pods on the node

tsde commented 1 year ago

RKE version: v1.4.6

Docker version: (docker version,docker info preferred) 20.10.23

Docker info


Server:
 Containers: 16
  Running: 9
  Paused: 0
  Stopped: 7
 Images: 18
 Server Version: 20.10.23
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 92b3a9d6f1b3bcc6dc74875cfdea653fe39f09c2
 runc version: 81a44cf162f4409cc6ff656e2433b87321bf8a7a
 init version: 
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.113-flatcar
 Operating System: Flatcar Container Linux by Kinvolk 3510.2.3 (Oklo)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 5.764GiB
 Name: fc-test-01
 ID: 7ZKL:S2NN:LZ6E:5747:QXU3:LV7E:6THC:ERRD:5ARO:5BKF:LL5Y:TDAV
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred) Flatcar Container Linux 3510.2.3 kernel 5.15.113-flatcar

(also tested on Ubuntu 22.04)

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) VM (vSphere / VirtualBox)

cluster.yml file:

nodes:
- address: fc-test-01
  port: "22"
  internal_address: 192.168.56.10
  role:
  - controlplane
  - etcd
  hostname_override: ""
  user: core
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_ed25519
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
- address: fc-test-02
  port: "22"
  internal_address: 192.168.56.11
  role:
  - worker
  hostname_override: ""
  user: core
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_ed25519
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
services:
  etcd:
    image: ""
    extra_args: {}
    extra_args_array: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_args_array: {}
    win_extra_binds: []
    win_extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: ""
    uid: 0
    gid: 0
    snapshot: null
    retention: ""
    creation: ""
    backup_config: null
  kube-api:
    image: ""
    extra_args: {}
    extra_args_array: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_args_array: {}
    win_extra_binds: []
    win_extra_env: []
    service_cluster_ip_range: 10.43.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    pod_security_configuration: ""
    always_pull_images: false
    secrets_encryption_config: null
    audit_log: null
    admission_configuration: null
    event_rate_limit: null
  kube-controller:
    image: ""
    extra_args:
      cluster-signing-cert-file: /etc/kubernetes/ssl/kube-ca.pem
      cluster-signing-key-file: /etc/kubernetes/ssl/kube-ca-key.pem
    extra_args_array: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_args_array: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_cidr: 10.42.0.0/16
    service_cluster_ip_range: 10.43.0.0/16
  scheduler:
    image: ""
    extra_args: {}
    extra_args_array: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_args_array: {}
    win_extra_binds: []
    win_extra_env: []
  kubelet:
    image: ""
    extra_args: {}
    extra_args_array: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_args_array: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_domain: cluster.local
    infra_container_image: ""
    cluster_dns_server: 10.43.0.10
    fail_swap_on: false
    generate_serving_certificate: false
  kubeproxy:
    image: ""
    extra_args: {}
    extra_args_array: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_args_array: {}
    win_extra_binds: []
    win_extra_env: []
network:
  plugin: calico
  options:
    calico_cloud_provider: none
    calico_flex_volume_plugin_dir: /var/lib/kubelet/volumeplugins
  mtu: 0
  node_selector: {}
  update_strategy: null
  tolerations: []
authentication:
  strategy: x509
  sans: []
  webhook: null
addons: ""
addons_include: []
system_images:
  etcd: rancher/mirrored-coreos-etcd:v3.5.6
  alpine: rancher/rke-tools:v0.1.89
  nginx_proxy: rancher/rke-tools:v0.1.89
  cert_downloader: rancher/rke-tools:v0.1.89
  kubernetes_services_sidecar: rancher/rke-tools:v0.1.89
  kubedns: rancher/mirrored-k8s-dns-kube-dns:1.22.20
  dnsmasq: rancher/mirrored-k8s-dns-dnsmasq-nanny:1.22.20
  kubedns_sidecar: rancher/mirrored-k8s-dns-sidecar:1.22.20
  kubedns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.6
  coredns: rancher/mirrored-coredns-coredns:1.9.4
  coredns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.6
  nodelocal: rancher/mirrored-k8s-dns-node-cache:1.22.20
  kubernetes: rancher/hyperkube:v1.26.4-rancher2
  flannel: rancher/mirrored-flannel-flannel:v0.21.4
  flannel_cni: rancher/flannel-cni:v0.3.0-rancher8
  calico_node: rancher/mirrored-calico-node:v3.25.0
  calico_cni: rancher/calico-cni:v3.25.0-rancher1
  calico_controllers: rancher/mirrored-calico-kube-controllers:v3.25.0
  calico_ctl: rancher/mirrored-calico-ctl:v3.25.0
  calico_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.25.0
  canal_node: rancher/mirrored-calico-node:v3.25.0
  canal_cni: rancher/calico-cni:v3.25.0-rancher1
  canal_controllers: rancher/mirrored-calico-kube-controllers:v3.25.0
  canal_flannel: rancher/mirrored-flannel-flannel:v0.21.4
  canal_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.25.0
  weave_node: weaveworks/weave-kube:2.8.1
  weave_cni: weaveworks/weave-npc:2.8.1
  pod_infra_container: rancher/mirrored-pause:3.7
  ingress: rancher/nginx-ingress-controller:nginx-1.7.0-rancher1
  ingress_backend: rancher/mirrored-nginx-ingress-controller-defaultbackend:1.5-rancher1
  ingress_webhook: rancher/mirrored-ingress-nginx-kube-webhook-certgen:v20230312-helm-chart-4.5.2-28-g66a760794
  metrics_server: rancher/mirrored-metrics-server:v0.6.3
  windows_pod_infra_container: rancher/mirrored-pause:3.7
  aci_cni_deploy_container: noiro/cnideploy:5.2.7.1.81c2369
  aci_host_container: noiro/aci-containers-host:5.2.7.1.81c2369
  aci_opflex_container: noiro/opflex:5.2.7.1.81c2369
  aci_mcast_container: noiro/opflex:5.2.7.1.81c2369
  aci_ovs_container: noiro/openvswitch:5.2.7.1.81c2369
  aci_controller_container: noiro/aci-containers-controller:5.2.7.1.81c2369
  aci_gbp_server_container: noiro/gbp-server:5.2.7.1.81c2369
  aci_opflex_server_container: noiro/opflex-server:5.2.7.1.81c2369
ssh_key_path: ~/.ssh/id_ed25519
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: null
enable_cri_dockerd: null
kubernetes_version: ""
private_registries: []
ingress:
  provider: "none"
  options: {}
  node_selector: {}
  extra_args: {}
  dns_policy: ""
  extra_envs: []
  extra_volumes: []
  extra_volume_mounts: []
  update_strategy: null
  http_port: 0
  https_port: 0
  network_mode: ""
  tolerations: []
  default_backend: null
  default_http_backend_priority_class_name: ""
  nginx_ingress_controller_priority_class_name: ""
  default_ingress_class: null
cluster_name: ""
cloud_provider:
  name: ""
prefix_path: ""
win_prefix_path: ""
addon_job_timeout: 0
bastion_host:
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: ""
  ssh_cert: ""
  ssh_cert_path: ""
  ignore_proxy_env_vars: false
monitoring:
  provider: ""
  options: {}
  node_selector: {}
  update_strategy: null
  replicas: null
  tolerations: []
  metrics_server_priority_class_name: ""
restore:
  restore: false
  snapshot_name: ""
rotate_encryption_key: false
dns: null

Steps to Reproduce:

Deploy the cluster
When every worload is ready, restart the kubelet on one node using docker restart kubelet (you can also trigger a restart by modifying a kubelet setting in cluster.yml)

Results: All pods running on the same node as the kubelet are restarted Note that this does not affect nodes using cgroup v1. Only cgroup v2 nodes are impacted (i.e. docker and containerd are properly configured to use cgroup v2)

What I expected A restart of kubelet should not impact pods running in the cluster

Observations I've had this issue for quite some time now but I couldn't find time to properly investigate. I was able to eventually pinpoint the root cause of this issue. It is caused by this piece of code in the entrypoint.sh script of the rke-tools image used to start the kubelet. This code is 5 years old now and is only relevant to cgroup v1.

For quite some time now, kubelet is started with the cgroups-per-qos option set to True. This implies that kubelet will create its own cgroup hierarchy under the root cgroup on cgroup v2 systems. You end up with a directory /sys/fs/cgroup/kubepods.slice created by kubelet, then each QOS will have its own cgroup tree underneath it.

The problem with the entrypoint.sh script shipped with rke-tools (and mounted into kubelet container) is that each time kubelet restarts, a directory named kubepods is created in /sys/fs/cgroup/kubepods.slice. This triggers a deletion of the whole kubepods.slice hierarchy by the systemd process as seen in system logs :

Jul 04 13:18:30 fc-test-01 systemd[1]: docker-38d41b2bf969dfc2fda19bd00b1db1a90f65543eae652ac8d97583d90f226c43.scope: Deactivated successfully.
Jul 04 13:18:30 fc-test-01 systemd[1]: docker-1652f8e80c2df2e790c89adfe55e55a3d0ed3e26073cb4ba467ee1392f93c979.scope: Deactivated successfully.
Jul 04 13:18:30 fc-test-01 systemd[1]: Stopped docker-1652f8e80c2df2e790c89adfe55e55a3d0ed3e26073cb4ba467ee1392f93c979.scope.
Jul 04 13:18:30 fc-test-01 systemd[1]: docker-1652f8e80c2df2e790c89adfe55e55a3d0ed3e26073cb4ba467ee1392f93c979.scope: Consumed 3.336s CPU time.
Jul 04 13:18:30 fc-test-01 systemd[1]: Removed slice kubepods-burstable-pod79ff3635_4a2f_4b2a_a2d0_6c23ceffda4d.slice.
Jul 04 13:18:30 fc-test-01 systemd[1]: kubepods-burstable-pod79ff3635_4a2f_4b2a_a2d0_6c23ceffda4d.slice: Consumed 4.226s CPU time.
Jul 04 13:18:30 fc-test-01 systemd[1]: Removed slice kubepods-besteffort.slice.
Jul 04 13:18:30 fc-test-01 systemd[1]: kubepods-besteffort.slice: Consumed 1.136s CPU time.
Jul 04 13:18:30 fc-test-01 systemd[1]: Removed slice kubepods-burstable.slice.
Jul 04 13:18:30 fc-test-01 systemd[1]: kubepods-burstable.slice: Consumed 4.226s CPU time.
Jul 04 13:18:30 fc-test-01 systemd[1]: Removed slice kubepods.slice.
Jul 04 13:18:30 fc-test-01 systemd[1]: kubepods.slice: Consumed 5.362s CPU time.
Jul 04 13:18:30 fc-test-01 systemd[1]: var-lib-docker-containers-38d41b2bf969dfc2fda19bd00b1db1a90f65543eae652ac8d97583d90f226c43-mounts-shm.mount: Deactivated successfully.
Jul 04 13:18:30 fc-test-01 systemd[1]: var-lib-docker-overlay2-ec128d653e5b33aaafe4f6a2396688c4b626963f3d6a411ff262d984b05489d4-merged.mount: Deactivated successfully.
Jul 04 13:18:30 fc-test-01 systemd[1]: var-lib-docker-overlay2-62214adaae9ab866c515ce630fa1bea8057af552f01ae5709dd57960360943bf-merged.mount: Deactivated successfully.

When kubelet comes back, it cannot find its cgroup hierarchy anymore and creates a new one and restarts all pods on the node.

How to fix The piece of code related to cgroup v1 in entrypoint.sh should be confined to only run on cgroup v1 systems. PR : https://github.com/rancher/rke-tools/pull/164

EDIT: Added link to the rke-tools PR

SURE-6766

jiaqiluo commented 1 year ago

The bug is reproduced on rke v1.4.8.
Steps:

use rke to provision an one-node cluster, k8s v1.24.8-rancher1-1.
ssh into the node, do docker restart kubelet

Result:

see all containers whose names has the prefix k8s_POD_ are restarting

Node Info:

cat /etc/os-release

```PRETTY_NAME="Ubuntu 22.04.2 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.2 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" ```

docker info

```Client: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Docker Buildx (Docker Inc., v0.10.4-docker) compose: Docker Compose (Docker Inc., v2.20.2) Server: Containers: 36 Running: 20 Paused: 0 Stopped: 16 Images: 15 Server Version: 20.10.24 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc Default Runtime: runc Init Binary: docker-init containerd version: 8165feabfdfe38c65b599c4993d227328c231fca runc version: v1.1.8-0-g82f18fe init version: de40ad0 Security Options: apparmor seccomp Profile: default cgroupns Kernel Version: 5.15.0-67-generic Operating System: Ubuntu 22.04.2 LTS OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 7.763GiB Name: jiaqi-2204-1 ID: 7TSF:CDEC:GFX6:R2QR:HRU2:JESQ:KQBK:Q6E7:A2TT:C3QX:W2B3:ZW2M Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false```

tmsdce commented 1 year ago

Hello @jiaqiluo Thanks for your feedback.

Is the fix I proposed in https://github.com/rancher/rke-tools/pull/164 is suitable to be released soon ? Maybe @kinarashah or @jakefhyde didn't have time to review yet ?

slickwarren commented 1 year ago

Some questions from QA to follow up on:

when does kubelet restart without user intervention
are there differences in cgroup1 vs cgroup2 that will affect kubelet behavior outside of this specific use case where all pods are restarting
do we have integration / v2prov specific tests that will 'regression' test kubelet for general functionality

jiaqiluo commented 1 year ago

The fix is available since rke-tools v0.1.92, so any k8s version that uses the rke-tools image whose tag is v0.1.92 and higher should have the fix.

jiaqiluo commented 1 year ago

waiting for the next rke rc

snasovich commented 1 year ago

This is ready to test on https://github.com/rancher/rke/releases/tag/v1.4.11-rc1

Josh-Diamond commented 1 year ago

Ticket #3280 - Test Results - ✅

Reproduced w/ RKE v1.4.8 and k8s v1.24.8-rancher1-1:

Using RKE v1.4.8, spin up a single-node cluster using k8s v1.24.8-rancher1-1
Once active, ssh into the node and run docker restart kubelet
Reproduced - containers w/ prefix k8s_POD_ are restarted; unexpected behavior

Verifed w/ RKE v1.4.11-rc1 and k8s v1.26.8-rancher1-1:

Using RKE v1.4.11-rc1, spin up a single-node cluster using k8s v1.27.6-rancher1-1
Once active, ssh into the node and run docker restart kubelet
Verified - only kubelet is restarted; as expected

rancher / rke

On cgroup v2 systems, a restart of kubelet container triggers a restart of all pods on the node #3280

Ticket #3280 - Test Results - ✅