[4.11] Kubelet consumes a lot of CPU and hangs after running a lot of cronjobs

yaroslavkasatikov commented 2 years ago

[vrutkovs] See below for thread summary

Hi team, After upgrading to 4.11 we faced to new issue:

We use cronjobs with ' *'. After upgrading some pods from cronjobs stuck in "ContainerCreating" or "Init 0/1" status. In pod describe we can see: `Events: Type Reason Age From Message

Normal Scheduled 51s default-scheduler Successfully assigned 0xbet-prod/podname7673773-q6bq5 to ip-10-0-216-195.eu-central-1.compute.internal by ip-10-0-201-118 Warning FailedCreatePodContainer 9s kubelet unable to ensure pod container exists: failed to create container for [kubepods burstable pod3847723b-b7c8-4adc-a9d7-f3cdb83ae03f] : Timeout waiting for systemd to create kubepods-burstable-pod3847723b_b7c8_4adc_a9d7_f3cdb83ae03f.slice Normal AddedInterface multus Add eth0 [10.133.10.218/23] from ovn-kubernetes Normal Pulled kubelet Container image "ghcr.io/banzaicloud/vault-env:1.13.0" already present on machine Normal Created kubelet Created container copy-vault-env Normal Started kubelet Started container copy-vault-env .... `

The symptom is pod scheduling slower and slower and after some time it stuck on Normal Scheduled 51s default-scheduler Successfully assigned 0xbet-prod/podname7673773-q6bq5 to ip-10-0-216-195.eu-central-1.compute.internal by ip-10-0-201-118

While login to node , I can see this in journalctl:

`Aug 13 21:38:49 ip-10-0-216-195 hyperkube[1546]: I0813 21:38:49.823654 1546 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubepods burstable podbcc994a0-6720-48fc-889b

Aug 13 21:38:49 ip-10-0-216-195 hyperkube[1546]: I0813 21:38:49.834117 1546 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubepods burstable pod36584156-6776-410f-861b

Aug 13 21:38:49 ip-10-0-216-195 hyperkube[1546]: I0813 21:38:49.835958 1546 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubepods burstable pod24ca3b44-6fb3-47be-b111

Aug 13 21:38:49 ip-10-0-216-195 hyperkube[1546]: I0813 21:38:49.837512 1546 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubepods burstable pod23e81c45-5fd2-4e7a-975c`

Reboot helps, but not for a long time. As result pods can't schedule and stuck on whole cluster.

Short thread summary.

What we know so far:

bug is happening on any new container starts, cronjob is an easy way to trigger it, but other ways to create new pods would also show the same behaviour
kubelet seems to use a lot of CPU and RAM
switching to cgroupsv2 seems to stop RAM bleeding, but doesn't help with CPU load
- Note that fresh OKD 4.9 installs are using cgroupsv2 by default, upgraded cluster need to be switched manually

Probable cause:

latest kernel is known to leak blkios, this is mitigated by the workaround (see below)

Workaround: https://github.com/okd-project/okd/issues/1310#issuecomment-1312848841 - thanks to @framelnl there's a DaemonSet which can clean up the extra cgroups

Upstream issue refs:

https://github.com/kubernetes/kubernetes/issues/106957#issuecomment-1147441007
https://github.com/kubernetes/kubernetes/pull/113050 (1.25 cherrypick of crio fix)

kai-uwe-rommel commented 1 year ago

So we'll have to wait a little longer ... I have noticed BTW that when I see that kubelet memory climbs up a lot and I then only restart kubelet, that will cure the memory problem but the node will crash later anyway. Probably due to cgroups not being cleaned up? So I keep rebooting the nodes when the problem occurs. That then usually helps for that node for about 3 days or so.

msteenhu commented 1 year ago

We see same or similar problems in a 4.9 production cluster: the most busy (= constantly creating containers due to crons) bare metal node crashes hard after a few months of operation. We can postpone this using the workaround that cleans up cgroups (and other things) but eventually the node still crashes..

So guess the workaround is draining an rebooting the most busy node every few months until we can upgrade to a stable OKD release, hopefully soon.

kai-uwe-rommel commented 1 year ago

BTW, does anyone see this problem also on OCP 4.11?

Gilthoniel commented 1 year ago

Hi, We're running 4.11.0-0.okd-2022-07-29-154152 and can confirm that this issue is happening for us.

kai-uwe-rommel commented 1 year ago

(I was asking for OCP = OpenShift 4.11.)

Tsimon-Dorakh commented 1 year ago

Experiencing accidental memory consumption growth by kubelet after upgrading to 4.11.0-0.okd-2022-10-15-073651 and switching on "systemd.unified_cgroup_hierarchy=0". It happens randomly on different nodes

vrutkovs commented 1 year ago

~~Right, that seems to be cgroupsv2 related then~~

So, these are graphs for cgroupv1 node upgraded to latest stable and no changes performed? Is the pattern the same if you switch to cgroupsv2?

Tsimon-Dorakh commented 1 year ago

Not sure. Need to switch back to cgroupsv2 and run for a few days. Will try to do it a bit later

msteenhu commented 1 year ago

BTW, does anyone see this problem also on OCP 4.11?

Let's see for ourself ;-)

"You have 60 days left to try Red Hat® OpenShift® Container Platform."

Tsimon-Dorakh commented 1 year ago

After removing "systemd.unified_cgroup_hierarchy=0" from MachineConfigPool (switching back to default cgroup v2) no memory issues so far. @vrutkovs is there a chance it's fixed in 4.11.0-0.okd-2022-10-28-153352?

vrutkovs commented 1 year ago

Its unlikely, no kubelet/systemd bump:

Upgraded packages

``` Upgraded: NetworkManager 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36 NetworkManager-cloud-setup 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36 NetworkManager-libnm 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36 NetworkManager-ovs 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36 NetworkManager-team 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36 NetworkManager-tui 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36 amd-gpu-firmware 20220913-140.fc36 -> 20221012-141.fc36 bash 5.2.2-1.fc36 -> 5.2.2-2.fc36 btrfs-progs 5.18-1.fc36 -> 6.0-1.fc36 catatonit 0.1.7-5.fc36 -> 0.1.7-10.fc36 chrony 4.2-5.fc36 -> 4.3-1.fc36 conmon 2:2.1.4-2.fc36 -> 2:2.1.4-3.fc36 coreos-installer 0.16.0-1.fc36 -> 0.16.1-2.fc36 coreos-installer-bootinfra 0.16.0-1.fc36 -> 0.16.1-2.fc36 dbus 1:1.14.0-1.fc36 -> 1:1.14.4-1.fc36 dbus-common 1:1.14.0-1.fc36 -> 1:1.14.4-1.fc36 dbus-libs 1:1.14.0-1.fc36 -> 1:1.14.4-1.fc36 ethtool 2:5.19-1.fc36 -> 2:6.0-1.fc36 fedora-release-common 36-18 -> 36-20 fedora-release-coreos 36-18 -> 36-20 fedora-release-identity-coreos 36-18 -> 36-20 fuse-overlayfs 1.9-1.fc36 -> 1.9-6.fc36 git-core 2.37.3-1.fc36 -> 2.38.1-1.fc36 glibc 2.35-17.fc36 -> 2.35-20.fc36 glibc-common 2.35-17.fc36 -> 2.35-20.fc36 glibc-minimal-langpack 2.35-17.fc36 -> 2.35-20.fc36 gnutls 3.7.7-1.fc36 -> 3.7.8-2.fc36 grub2-common 1:2.06-53.fc36 -> 1:2.06-54.fc36 grub2-efi-x64 1:2.06-53.fc36 -> 1:2.06-54.fc36 grub2-pc 1:2.06-53.fc36 -> 1:2.06-54.fc36 grub2-pc-modules 1:2.06-53.fc36 -> 1:2.06-54.fc36 grub2-tools 1:2.06-53.fc36 -> 1:2.06-54.fc36 grub2-tools-minimal 1:2.06-53.fc36 -> 1:2.06-54.fc36 intel-gpu-firmware 20220913-140.fc36 -> 20221012-141.fc36 kernel 5.19.14-200.fc36 -> 5.19.16-200.fc36 kernel-core 5.19.14-200.fc36 -> 5.19.16-200.fc36 kernel-modules 5.19.14-200.fc36 -> 5.19.16-200.fc36 libidn2 2.3.3-1.fc36 -> 2.3.4-1.fc36 libksba 1.6.0-3.fc36 -> 1.6.2-1.fc36 libmaxminddb 1.6.0-2.fc36 -> 1.7.1-1.fc36 libsmbclient 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36 libwbclient 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36 libxml2 2.9.14-1.fc36 -> 2.10.3-1.fc36 linux-firmware 20220913-140.fc36 -> 20221012-141.fc36 linux-firmware-whence 20220913-140.fc36 -> 20221012-141.fc36 nvidia-gpu-firmware 20220913-140.fc36 -> 20221012-141.fc36 procps-ng 3.3.17-4.fc36 -> 3.3.17-4.fc36.1 qemu-guest-agent 2:6.2.0-15.fc36 -> 2:6.2.0-16.fc36 rpm-ostree 2022.13-1.fc36 -> 2022.14-1.fc36 rpm-ostree-libs 2022.13-1.fc36 -> 2022.14-1.fc36 rsync 3.2.6-1.fc36 -> 3.2.7-1.fc36 runc 2:1.1.3-1.fc36 -> 2:1.1.4-1.fc36 samba-client-libs 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36 samba-common 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36 samba-common-libs 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36 skopeo 1:1.9.2-1.fc36 -> 1:1.10.0-3.fc36 ssh-key-dir 0.1.3-2.fc36 -> 0.1.4-1.fc36 tzdata 2022d-1.fc36 -> 2022e-1.fc36 unbound-libs 1.16.3-1.fc36 -> 1.16.3-2.fc36 vim-data 2:9.0.475-1.fc36 -> 2:9.0.803-1.fc36 vim-minimal 2:9.0.475-1.fc36 -> 2:9.0.803-1.fc36 xmlsec1 1.2.33-2.fc36 -> 1.2.33-3.fc36 xmlsec1-openssl 1.2.33-2.fc36 -> 1.2.33-3.fc36 ```

My home cluster has been fairly stable last release too (with /var/run -> /run workaround), so lets wait for more reports

msteenhu commented 1 year ago

I am testing a lot of 'hello' cronjobs in my home Openshift 4.11.9 cluster. Time will tell if my little VMs will remain stable.

But the first thing that stands out for me comparing latest Openshift versus OKD is the crio version: '1.24.3-4.rhaos4.11.git0e72422.el8' versus 1.24.0.

Any reason why OKD lags behind? If I check other components (kernel, runc) OKD is usually ahead. Should study the changelogs but it seems obvious to me that a .3 will have fixed some bugs compared to the .0.

[core@master1 ~]$ crio version && runc -v
INFO[2022-11-01 04:51:22.893623628Z] Starting CRI-O, version: 1.24.3-4.rhaos4.11.git0e72422.el8, git: ()
Version:          1.24.3-4.rhaos4.11.git0e72422.el8
GoVersion:        go1.18.4
Compiler:         gc
Platform:         linux/amd64
Linkmode:         dynamic
BuildTags:        exclude_graphdriver_devicemapper, containers_image_ostree_stub, seccomp, selinux
SeccompEnabled:   true
AppArmorEnabled:  false
runc version 1.1.2
spec: 1.0.2-dev
go: go1.18
libseccomp: 2.5.2
[core@mec-okdtest-master-01 ~]$ crio version && runc -v
INFO[2022-11-01 04:51:37.165262983Z] Starting CRI-O, version: 1.24.0, git: ()
Version:          1.24.0
GoVersion:        go1.18
Compiler:         gc
Platform:         linux/amd64
Linkmode:         dynamic
BuildTags:        seccomp, selinux
SeccompEnabled:   true
AppArmorEnabled:  false
runc version 1.1.4
spec: 1.0.2-dev
go: go1.18.7
libseccomp: 2.5.3

vrutkovs commented 1 year ago

OKD uses CRI-O module from Fedora repos. There are newer builds in Koji, which were never submitted to Bodhi and never got to actual repos (@LorbusChris, should we revert this? This clearly makes maintainers do more work and now OKD lags behind, https://github.com/openshift/okd-machine-os/pull/468)

LorbusChris commented 1 year ago

cri-o 1.24.3 is in F36 now, so this should land in the next machine-os-conent.

msteenhu commented 1 year ago

It seems Openshift 4.11.9 is more stable than the latest OKD. Openshift has a smaller load but should be similar enough. I run in both the cleanup DaemonSet, also seems to remove CRIO stuff in Openshift, but it does not find left behind cgroups, which is a clear difference with OKD in my clusters.

Openshift did crash (nodes rebooted) with 10 'hello' cron jobs after some hours but that was maybe just too much for the 3 small master+worker test VMs.

Running 3 'hello' cron jobs I get these 'system slice' memory usage graphs of the node running the short lived 'hello' pods.

Openshift vs OKD system slice memory usage:

kai-uwe-rommel commented 1 year ago

I have a test/demo cluster with OCP 4.10 available and thought about upgrading it to 4.11 to see what happens. However, Red Hat also seems to not recommend upgrading OCP to 4.11?

grafik

There does not seem to be anything wrong with any MCP at least not as far as I can see.

LorbusChris commented 1 year ago

Possibly also related https://github.com/coreos/fedora-coreos-tracker/issues/1330

msteenhu commented 1 year ago

Latest 4.11 includes CRIO 1.24.3 but that does not seem to make a difference for my test OKD cluster: cgroups still seem to be left behind (maybe because it is using cgroups v2?) and memory slowly but surely fills up running 3 'hello' cron jobs.

oc get nodes -o wide | tail -n1 | awk '{print $8,$9,$10,$11,$12}'
Fedora CoreOS 36 6.0.5-200.fc36.x86_64 cri-o://1.24.3

kai-uwe-rommel commented 1 year ago

I was now able to updated the cluster mentioned above (update block vanished away) two days ago to OCP 4.11.9. I have not seen the problem on this cluster yet, but it's only been 2 days so far.

kai-uwe-rommel commented 1 year ago

The next OKD release of 4.11, 2022-11-05 is there. Same Kubernetes version, so it's not supposed to fix this problem yet, either?

msteenhu commented 1 year ago

The next OKD release of 4.11, 2022-11-05 is there. Same Kubernetes version, so it's not supposed to fix this problem yet, either?

Nope, my previous reply was talking about latest version..

tyronewilsonfh commented 1 year ago

OKD version: 4.11.0-0.okd-2022-11-05-030711

Have been testing effect on memory usage with cgroups v1 and cgroups v2 Below steps done so far:

cgroups v2 machineconfig applied at ~12:00 (https://docs.okd.io/4.11/post_installation_configuration/machine-configuration-tasks.html#nodes-nodes-cgroups-2_post-install-machine-configuration-tasks)

cronjob started at ~14:00 which scales an alpine deployment between 100 and 0 replicas every minute (targets a single node)

cgroups v2 machine config deleted at ~17:00, rebooted node anytime it was becoming unresponsive

cgroups v2 machine config re-added ~22:00

Restarted node this morning and halved cronjob test values to see how this is after a longer period of time

From all the suggestions here, switching to cgroups v2 seems to have had most impact so far for me with this specific test, will keep this test going for a week and check if I still have to restart other nodes in this cluster now they are all using cgroups v2

vrutkovs commented 1 year ago

Excellent, thanks. Could you check if a workaround from https://github.com/coreos/fedora-coreos-tracker/issues/1330#issuecomment-1299330024 (disabling blkio controller)?

llomgui commented 1 year ago

Excellent, thanks. Could you check if a workaround from coreos/fedora-coreos-tracker#1330 (comment) (disabling blkio controller)?

Should be something like that right?

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-disable-blkio
spec:
  config:
    ignition:
      version: 3.3.0
    shouldExist:
      - systemd.unified_cgroup_hierarchy=0
    systemd:
      units:
        - contents: |
            [Slice]
            DisableControllers=blkio
          name: "-.slice"

kai-uwe-rommel commented 1 year ago

The problem affects master nodes too, so the MC would need to be applied to these as well.

msteenhu commented 1 year ago

All my OKD clusters (4.9, 4.10 and 4.11) have cgroups v2. Can't seem to enable v1 on latest 4.11 test cluster with kernel 6... How is it possible others are running v1? Older machineconfigs that get dragged along with upgrades. Can't remember I did something custom that might affect this.

mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)

llomgui commented 1 year ago

All my OKD clusters (4.9, 4.10 and 4.11) have cgroups v2. Can't seem to enable v1 on latest 4.11 test cluster with kernel 6... How is it possible others are running v1? Older machineconfigs that get dragged along with upgrades. Can't remember I did something custom that might affect this.
mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)

If you have java applications running on an old version. It does not support cgroup v2. You have to force the use of cgroup v1 using this MC:


apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-openshift-machineconfig-worker-kargs
spec:
  kernelArguments:
    - systemd.unified_cgroup_hierarchy=0
EOF

bergerst commented 1 year ago

@msteenhu We managed to keep cgroupsv1 with the solution from #1002

msteenhu commented 1 year ago

With cgroupsv1 and that 'no blkio' '-.slice' file my 4.11 test cluster also leaks memory. Steeper curve comparing with cgroups v2 indeed but leaking it does, unfortunately.

tyronewilsonfh commented 1 year ago

Excellent, thanks. Could you check if a workaround from coreos/fedora-coreos-tracker#1330 (comment) (disabling blkio controller)?

Have applied the below and tested this afternoon for few hours


kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-disable-blkio
spec:
  config:
    ignition:
      version: 3.2.0
    shouldExist:
      - systemd.unified_cgroup_hierarchy=0
    systemd:
      units:
        - contents: |
            [Slice]
            DisableControllers=blkio
          name: "-.slice"```

Still see

unable to destroy cgroup paths for cgroup [kubepods besteffort podd0de
061d-728e-4305-bb31-5f009ad1fe77] : Timed out while waiting for systemd to remove kubepods-besteffort-podd0de061d_728e_4305_bb31_5f009ad1fe77.slice"

marqsbla commented 1 year ago

I made some tests, I would like to share. Good news is that I confirm what said @tyronewilsonfh, that cgroups v2 are (at least partial) solution. I will try to be brief.

Test case: I run 10 parallel "hello" cronjobs every minute on a single node and observe what happens.

Nevertheless of version (including the latest 4.11.0-0.okd-2022-11-05), something crashes within the node, and after 1.5-2.5hours the pods stop creating (they are in container creating state) and finally kubelet eats all the memory and node is unusable. By default (without any changes) I have mounted cgroups v1 and v2.

What I did to try to overcome this problem (read from this issue):

I deployed daemonset with garbage collector, but cgroups are left behind and node crashes
I changed kubelet runtime argument /var/run -> /run, stil crashes in the same manner (including the newest okd version)
finally I switched to cgroups v2 and the node is more stable; after 20h, the cronjobs are still running, but creationg take a lot of time (up to 1m in comparison to few seconds before).

Few observations of the solution with cgroup v2:

kubelet has plenty of errors still (after this 20h)

sudo journalctl -u kubelet --since -10m | grep -i error | wc -l 
100833
sudo journalctl -u kubelet --since -10m | grep -i error | tail -1
Nov 10 10:59:30 dcw1-xyz-infra-0-jnkk8.novalocal hyperkube[366101]: W1110 10:59:30.696142  366101 container.go:589] Failed to update stats for container "/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod001b9cfe_5a2e_4d0e_9405_a208719380b4.slice/crio-c6e03c8c2faa6e96b873919e55c2089474a703d1f081b9ce457664747616edfd.scope": unable to determine device info for dir: /var/lib/containers/storage/overlay/b8f6d7526de2280ee997c5b826ce6e08888d5e72176712b4caffb8f00298cc53/diff: stat failed on /var/lib/containers/storage/overlay/b8f6d7526de2280ee997c5b826ce6e08888d5e72176712b4caffb8f00298cc53/diff with error: no such file or directory, continuing to push stats
$ find /sys/fs/cgroup/ -name kubepods-burstable-pod001b9cfe_5a2e_4d0e_9405_a208719380b4.slice
/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod001b9cfe_5a2e_4d0e_9405_a208719380b4.slice

cgroups are left behind ind "pids" directory:

$ ls -la /sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/ | wc -l 
11971

the memroy utilised is constatly rising on that node, but much less steep and doesn't crash the node

If I have time (after the weekend), I will try to improve the garbage collector to remove left over cgroups

PS. in comparison cgroup files left over, when using cgroups v1:

journalctl -u kubelet --since -10m | grep error | tail -1
Nov 10 10:44:58 dcw1-xyz-worker-1-zxy.novalocal hyperkube[2139]: W1110 10:44:58.392021    2139 container.go:589] Failed to update stats for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice/crio-483470c57fd83305e8850d78f250d21cab761a2b3e3b88fa243d3b8a06341d8f.scope": unable to determine device info for dir: /var/lib/containers/storage/overlay/6cada6086eec3951850fbf978ad8b72f848d4158450c58344723a4d95d65a779/diff: stat failed on /var/lib/containers/storage/overlay/6cada6086eec3951850fbf978ad8b72f848d4158450c58344723a4d95d65a779/diff with error: no such file or directory, continuing to push stats
$ find /sys/fs/cgroup/ -name kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/hugetlb/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/misc/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/perf_event/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice
/sys/fs/cgroup/unified/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podbae74a3e_be8f_4340_ac5e_16823909ab56.slice

kai-uwe-rommel commented 1 year ago

Thanks for the detailed report and - basically - the confirmation that we do not really have any reliable workaround or even solution. So we still wait for a solution of the problem at its root (in the Kubernetes code?). Regarding the cgroup garbage collector script/daemonset - I briefly tried it on an affected cluster here but the daemonset pods do not even start, due to a rights issue. But did not have time to dig further. Do the pods start for you or perhaps this is also for you the cause why it does not help (and you did not notice yet)?

marqsbla commented 1 year ago

Thanks for the detailed report and - basically - the confirmation that we do not really have any reliable workaround or even solution. So we still wait for a solution of the problem at its root (in the Kubernetes code?). Regarding the cgroup garbage collector script/daemonset - I briefly tried it on an affected cluster here but the daemonset pods do not even start, due to a rights issue. But did not have time to dig further. Do the pods start for you or perhaps this is also for you the cause why it does not help (and you did not notice yet)?

They start. I don't remember what I did to make it so. The pods needs to run as priviledged. Maybe something is blocking it? Probably SA, or namespace restrictions?

BTW. I tried quickly to run GC, by my script, but I cannot delete cgroups. My experience is limited on this, so probably I have to read more:

journalctl -u kubelet --since -10m  | grep "Failed to update stats for container" | grep -oE "kubepods-burstable-pod[^/]{10,}slice" | sed "s/.slice//" | xargs -i rm -rf "/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/{}.slice"

aneagoe commented 1 year ago

Thanks for the detailed report and - basically - the confirmation that we do not really have any reliable workaround or even solution. So we still wait for a solution of the problem at its root (in the Kubernetes code?). Regarding the cgroup garbage collector script/daemonset - I briefly tried it on an affected cluster here but the daemonset pods do not even start, due to a rights issue. But did not have time to dig further. Do the pods start for you or perhaps this is also for you the cause why it does not help (and you did not notice yet)?

@kai-uwe-rommel did you run the entire manifest as described under https://gist.github.com/aneagoe/6e18aaff48333ec059d0c1283b06813f? There are required rolebindings and SAs there, under https://gist.github.com/aneagoe/6e18aaff48333ec059d0c1283b06813f#file-permissions-yaml. @marqsbla I'd suggest reviewing the script in the above gist... I was able to run it manually on the nodes as well but after creating the daemonset I haven't bothered re-checking the manual runs. What errors are you getting? It would be interesting to improve the daemonset; I've only tested it on OKD 4.9 and 4.10.

kai-uwe-rommel commented 1 year ago

Yes, I did run all four. Interestingly, I just did it again (had deleted them last time after getting the errors) and now it works!

framelnl commented 1 year ago

Hey, i have been silently following this thread for a bit now as i have also been having issues with nodes going down with the same type of memory signature.

My cluster(okd4.11) does not run a whole lot of cronjobs (except for the logging stack). so nodes running out of cpu/memory do not happen (that) often. What i did notice on my cluste ris a lot of "kubelet_getters.go:300] "Path does not exist" path="/var/lib/kubelet/pods/${POD}/volumes". logs. From what i could gather from the issue referred to here it is a fairly harmless log. (https://github.com/kubernetes/kubernetes/issues/112124)

After doing some analysis of my own this weekend i found out that is not the case. I think particular log is pointing to the origin of the nodes going down (at least for me).

I think what happens is the following:

A pod is terminated (because of a normal cause, like a job that has run its course to competion).
After the pod is shut down its cleaned up by the housekeeping loop. Among other things, the housekeeping loop cleans up the directories and cgroups related to the pod. Cleanup up the cgroups is done through the dbus (systemd).
For some reason (i haven't found out why yet) it fails to cleanup all cgroups (this was also already reported in this issue). (note, on loglevel 2 this fails silently).
the next housekeeping loop, it tries to clean the directies (again). because the directory is (already) gone it wil log the message: "kubelet_getters.go:300] "Path does not exist" path="/var/lib/kubelet/pods/${POD}/volumes"
it tries to delete the cgroups again, and again it silently fails.
GOTO step 4.

my hypothesis is this either

eventually leads to systemd getting hammered with requests to stop cgroups. which in turn leads to systemd being overloaded and taking more time to respond to requests. At some point systemd will start to take longer then 30 seconds. i believe this to be the point where both the number of cpu cycles and memory starts to rapidly increase until the node runs out of memory and the oomkiller starts shutting things down.
or, the housekeeping loop takes longer and longer and keeps the kublet locked up. the housekeeping loop has a timeout of 15 seconds. once this is exceeded i expect the rapid increase of cpu and memory (due to retries).

I have created a simpe function to cleanup these particular cgroups and added them alongside the garbage-collector that was listed earlier. I now see a whole lot less "path does not exist errors".

i've forked the garbage-collector and added my function to it: https://gist.github.com/framelnl/63387ac66893cb8e856ad8a94bfccea0

Also, i also use the oc debug node container in the daemonset as registry.redhat.io requires a login.

kai-uwe-rommel commented 1 year ago

@framelnl, thanks for the detailed analysis. For the two clusters where I have the problem I do also see high CPU usage not only for kubelet but also for systemd on affected nodes, which aligns with your observations. What kills the nodes is then the memory "explosion" of the kubelet. See my comment of 2 days ago, I did then also install the garbage-collector from @aneagoe but since then I did already still have one node with such a kubelet memory runaway, so it did not really help yet. I will try your fork to see if your change makes a difference.

kai-uwe-rommel commented 1 year ago

@framelnl, regarding the image used for the daemonset ... to not depend on any external image at all, I usually use one that is always already present in the cluster's internal registry: "image-registry.openshift-image-registry.svc:5000/openshift/cli". As the name says, it contains the oc CLI, otherwise seems to be the standard RHEL 8 UBI.

alexminder commented 1 year ago

With cgroup v2 (systemd.unified_cgroup_hierarchy=1) no more kubelet oom_kills, cluster (okd 4.12) is stable for week now.

msteenhu commented 1 year ago

For me the stock option cgroups v2 also seems to be "stable", I am running 3 hello cron jobs for 1,5 weeks now in the latest OKD 4.11. I also have to run the cleanup daemonset to achieve this. But I do not find "Path does not exist" in my logs. So the addition from @framelnl is not needed in my setup, weirdly.

I quoted stable because the memory usage still seems to increase on the hosts, in the system slice. Can anyone confirm the following: 'systemd-journald' seems to have a memory leak. I restarted the service which decreased memory usage of the corresponding cgroup from +2G to <100M, after which it builds up again. The system slice still reports using the memory but host memory usage builld up seems to slow down or stop until journald is back to its previous level.

What's puzzling me the most is the differences in behaviour people are reporting. How can this ever be fixed? There are definitely multiple bugs at play in crio(?), kubelet(?), systemd(?). I believe it is important to mention as much details as possible: OKD version, if you use cgroups 2 (stock) or 1 (maybe you changed it for legacy Java versions), ...

marqsbla commented 1 year ago

For me the stock option cgroups v2 also seems to be "stable", I am running 3 hello cron jobs for 1,5 weeks now in the latest OKD 4.11. I also have to run the cleanup daemonset to achieve this. But I do not find "Path does not exist" in my logs. So the addition from @framelnl is not needed in my setup, weirdly.

I quoted stable because the memory usage still seems to increase on the hosts, in the system slice. Can anyone confirm the following: 'systemd-journald' seems to have a memory leak. I restarted the service which decreased memory usage of the corresponding cgroup from +2G to <100M, after which it builds up again. The system slice still reports using the memory but host memory usage builld up seems to slow down or stop until journald is back to its previous level.

What's puzzling me the most is the differences in behaviour people are reporting. How can this ever be fixed? There are definitely multiple bugs at play in crio(?), kubelet(?), systemd(?). I believe it is important to mention as much details as possible: OKD version, if you use cgroups 2 (stock) or 1 (maybe you changed it for legacy Java versions), ...

@msteenhu thanks for info. Did you face any problem when switching to cgroup v2, for instance with masters? Logging stack? I restrained myself from switching to cgroup v2 on the whole cluster before confirming that everything works :).

Did you make any modifications to garbage collector, or you use this one? I didn't have time to debug if it could be improved. I'm wondering if it works smoothly with cgroup v2...

msteenhu commented 1 year ago

@msteenhu thanks for info. Did you face any problem when switching to cgroup v2, for instance with masters? Logging stack? I restrained myself from switching to cgroup v2 on the whole cluster before confirming that everything works :).

Did you make any modifications to garbage collector, or you use this one? I didn't have time to debug if it could be improved. I'm wondering if it works smoothly with cgroup v2...

Well, AFAIK cgroups 2 is the norm with OKD since 4.9:

[root@mec-okdtest-worker-01 ~]# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)

So you installed a MachineConfig to switch it back to cgroups v1 then? I use the mentioned garbage collector already since 4.9. I just did a small change to kill the log spamming (in our production OKD 4.9) from the DaemonSet. @aneagoe helped me out after a question in #openshift-users on the kubernetes Slack by putting his solution on Github. All that time my clusters were running cgroups v2 so my guess is that the garbage collector works best with cgroups v2(?).

A typical GC run on latest 4.11 - cgroups v2, so it definitely seems to work:

2022-11-09T15:15:21+00:00 Starting k8s garbage collector run...
  2022-11-09T15:15:25+00:00 Found POD hello2-27800112-5ttmg unknown to k8s control plane and without any PIDs, will delete it...
  Stopped sandbox 279c4b29513a225ea6cb59e34b947a8cc7940dc857e34cd97196d6bfe0581638
  Removed sandbox 279c4b29513a225ea6cb59e34b947a8cc7940dc857e34cd97196d6bfe0581638
  2022-11-09T15:15:26+00:00 Found POD hello-27800112-r9vwm unknown to k8s control plane and without any PIDs, will delete it...
  Stopped sandbox 2d349422eb994d54b568e1edf4ef4c4803ccd105732b989ae13cbf7a5791f215
  Removed sandbox 2d349422eb994d54b568e1edf4ef4c4803ccd105732b989ae13cbf7a5791f215
  2022-11-09T15:15:26+00:00 Found POD hello3-27800112-crltx unknown to k8s control plane and without any PIDs, will delete it...
  Stopped sandbox 52bdcc392df33c68546b512f866362d08c47ecf386dcafbbaf5d51221e8bc25b
  Removed sandbox 52bdcc392df33c68546b512f866362d08c47ecf386dcafbbaf5d51221e8bc25b
  2022-11-09T15:15:27+00:00 Found POD hello3-27800111-fbksd unknown to k8s control plane and without any PIDs, will delete it...
  Stopped sandbox b73e6^C1-09T15:15:29+00:00 Found POD hello2-27800110-sj6vf unknown to k8s control plane and without any PIDs, will delete it...
  Stopped sandbox 87111417a7b6f10be47305f44ada337be80d98999a5241c16a2d9484840d21ff
  Removed sandbox 87111417a7b6f10be47305f44ada337be80d98999a5241c16a2d9484840d21ff
  2022-11-09T15:15:29+00:00 Found POD hello-27800110-g6hxw unknown to k8s control plane and without any PIDs, will delete it...
  Stopped sandbox c5d265e6cf53a1f231668195551969f841b1ba42b7e2f9b99e591bad2f31e773
  Removed sandbox c5d265e6cf53a1f231668195551969f841b1ba42b7e2f9b99e591bad2f31e773
  2022-11-09T15:15:30+00:00 Found POD collect-profiles-27800070-pfxwr unknown to k8s control plane and without any PIDs, will delete it...
  Stopped sandbox 29c5ed5502c267b909bea7993ea3d42068fdc2394ed6ca4aa545dffa0269a1ce
  Removed sandbox 29c5ed5502c267b909bea7993ea3d42068fdc2394ed6ca4aa545dffa0269a1ce
  2022-11-09T15:15:39+00:00 Removing CGROUP crio-04c4ec60ed4c9995c21827b1ecd3988cf2c85a5bb51510cd637f0620cda78a10.scope and its parent...
  2022-11-09T15:15:39+00:00 Removing CGROUP crio-05839c28cb1afc376921ed0691efb783fe673d25310c39c99e1fc219097a7f4d.scope and its parent...
  2022-11-09T15:15:39+00:00 Removing CGROUP crio-0f123fe8e191c91960b50c4ba2f471cd09b03dd6198eb275d33949021d366273.scope and its parent...
  2022-11-09T15:15:39+00:00 Removing CGROUP crio-1761f14917570cff11bc3469edd753372ad7bf3e5f02cf228dc6ca706e2be683.scope and its parent...
  2022-11-09T15:15:39+00:00 Scope crio-279c4b29513a225ea6cb59e34b947a8cc7940dc857e34cd97196d6bfe0581638.scope found under running pod, skipping...
  2022-11-09T15:15:39+00:00 Removing CGROUP crio-28ab0b0619fdcd71e34a1a7c30121c1f438a6b17dc5193ff1c318088323aa007.scope and its parent...
  2022-11-09T15:15:39+00:00 Removing CGROUP crio-2c507308215ecab36d126e905aa02658d27cc46d5df8c62cf97c2ed40bf7f4fa.scope and its parent...
  2022-11-09T15:15:39+00:00 Removing CGROUP crio-2d0b85008d03bfd368113a67c9888fe2dda7a83855b37d3a2f657434be061d99.scope and its parent...
  2022-11-09T15:15:40+00:00 Removing CGROUP crio-2df59b3f717089b999dbdfe5670d8d8e667100d979498a63a174ded370d91ded.scope and its parent...

And to answer your question: We had no problems with cgroups v2. Only one Java based app (Apache Guacamole) suddenly was forking threads like crazy exhausting the max_pids limit. We increased that limit not realising it was a Java bug at that time. Everything else works fine.

framelnl commented 1 year ago

@msteenhu to clarify, i run my cluster with cgroups v1 as i upgraded it from 4.8.

marqsbla commented 1 year ago

@msteenhu I think I originally installed 4.7 and updated since then. I also have cgroups v1 in OKD 4.11 as a default.

Ok, good to know! Thanks

kai-uwe-rommel commented 1 year ago

Will the default be different if a cluster was upgraded from previous OKD versions? On my test cluster:

[root@master-03 ~]# mount|fgrep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,size=4096k,nr_inodes=1024,mode=755,inode64)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)

So upgraded and fresh clusters are really different? What is your advice how to "convert" upgraded clusters?

msteenhu commented 1 year ago

Will the default be different if a cluster was upgraded from previous OKD versions? On my test cluster:

So upgraded and fresh clusters are really different? What is your advice how to "convert" upgraded clusters?

Depending on your history there might be MachineConfig(s) in place influencing the Cgroups mount. You could verify for yourself with a command like this:

oc get mc -o yaml | grep kernelArguments -A10 | grep cgroup

kai-uwe-rommel commented 1 year ago

Thanks. No such MC found, so the cluster is on defaults. And this is apparently kind of "mixed".

marqsbla commented 1 year ago

@kai-uwe-rommel It looks for me that you have cgroup v2.

In my cluster it looks like this:

mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,size=4096k,nr_inodes=1024,mode=755,inode64)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,freezer)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,perf_event)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,memory)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpu,cpuacct)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,devices)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,hugetlb)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpuset)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,pids)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,blkio)
cgroup on /sys/fs/cgroup/misc type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,misc)

kai-uwe-rommel commented 1 year ago

It is really a bit unclear ... my test cluster seems to have both. But if you says it's using cgroup v2 - it still has the kubelet problem. While msteenhu reports that with cgroup v2 his cluster is running fine.

iklcp commented 1 year ago

I also had the kubelet memory leak + oomkiller on my cluster since the upgrade from 4.10 -> 4.11. (on random nodes, both worker and control). The garbarge collector patch alone did not change anything. Enabling cgroups v2 seems to have fixed the issue for me. The cluster was initial installed as 4.7.

okd-project / okd

[4.11] Kubelet consumes a lot of CPU and hangs after running a lot of cronjobs #1310