okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.7k stars 293 forks source link

[4.11] Kubelet consumes a lot of CPU and hangs after running a lot of cronjobs #1310

Closed yaroslavkasatikov closed 1 year ago

yaroslavkasatikov commented 2 years ago

[vrutkovs] See below for thread summary

Hi team, After upgrading to 4.11 we faced to new issue:

We use cronjobs with ' *'. After upgrading some pods from cronjobs stuck in "ContainerCreating" or "Init 0/1" status. In pod describe we can see: `Events: Type Reason Age From Message


Normal Scheduled 51s default-scheduler Successfully assigned 0xbet-prod/podname7673773-q6bq5 to ip-10-0-216-195.eu-central-1.compute.internal by ip-10-0-201-118 Warning FailedCreatePodContainer 9s kubelet unable to ensure pod container exists: failed to create container for [kubepods burstable pod3847723b-b7c8-4adc-a9d7-f3cdb83ae03f] : Timeout waiting for systemd to create kubepods-burstable-pod3847723b_b7c8_4adc_a9d7_f3cdb83ae03f.slice Normal AddedInterface multus Add eth0 [10.133.10.218/23] from ovn-kubernetes Normal Pulled kubelet Container image "ghcr.io/banzaicloud/vault-env:1.13.0" already present on machine Normal Created kubelet Created container copy-vault-env Normal Started kubelet Started container copy-vault-env .... `

The symptom is pod scheduling slower and slower and after some time it stuck on Normal Scheduled 51s default-scheduler Successfully assigned 0xbet-prod/podname7673773-q6bq5 to ip-10-0-216-195.eu-central-1.compute.internal by ip-10-0-201-118

While login to node , I can see this in journalctl:

`Aug 13 21:38:49 ip-10-0-216-195 hyperkube[1546]: I0813 21:38:49.823654 1546 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubepods burstable podbcc994a0-6720-48fc-889b

Aug 13 21:38:49 ip-10-0-216-195 hyperkube[1546]: I0813 21:38:49.834117 1546 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubepods burstable pod36584156-6776-410f-861b

Aug 13 21:38:49 ip-10-0-216-195 hyperkube[1546]: I0813 21:38:49.835958 1546 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubepods burstable pod24ca3b44-6fb3-47be-b111

Aug 13 21:38:49 ip-10-0-216-195 hyperkube[1546]: I0813 21:38:49.837512 1546 pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubepods burstable pod23e81c45-5fd2-4e7a-975c`

Reboot helps, but not for a long time. As result pods can't schedule and stuck on whole cluster.


Short thread summary.

What we know so far:

Probable cause:

Workaround: https://github.com/okd-project/okd/issues/1310#issuecomment-1312848841 - thanks to @framelnl there's a DaemonSet which can clean up the extra cgroups

Upstream issue refs:

yaroslavkasatikov commented 2 years ago

https://dropmefiles.com/6nn0s must-gather

vrutkovs commented 2 years ago

Also noticed this on kubelet 1.24 and the situation gets worse the more pods are running.

Its not OKD specific, seems to be a dupe of https://issues.redhat.com/browse/OCPBUGSM-39381

vrutkovs commented 2 years ago

This is most likely a runc regression. Could you check if switching to runc crun helps (make sure you apply it to all MachineConfigPools). Example MachineConfig

Seems we'd need runc 1.1.3, which has two systemd/cgroups fixes

yaroslavkasatikov commented 2 years ago

This is most likely a runc regression. Could you check if switching to runc helps (make sure you apply it to all MachineConfigPools). Example MachineConfig

Seems we'd need runc 1.1.3, which has two systemd/cgroups fixes

@vrutkovs

Hi, Vadim! I have applied your machineconfig and recreate cron nodes.

Will report about results

yaroslavkasatikov commented 2 years ago

@vrutkovs Hi Vadim,

Seems it hasn't helped. Worked fine for 3h, but now one node returned to failed state:


  ----     ------                    ----   ----               -------
  Normal   Scheduled                 4m55s  default-scheduler  Successfully assigned bank-prod/bank-cronjob-27674577-7wjld to ip-10-0-217-73.eu-central-1.compute.internal by ip-10-0-133-210

  Warning  FailedCreatePodContainer  3m16s  kubelet            unable to ensure pod container exists: failed to create container for [kubepods burstable pod103107db-62e6-4f67-b2fd-50a066f4af36] : Timeout waiting for systemd to create kubepods-burstable-pod103107db_62e6_4f67_b2fd_50a066f4af36.slice
yaroslavkasatikov commented 2 years ago
Снимок экрана 2022-08-14 в 14 10 55

Seems that kube-rbac-proxy began to eat memory before the case.

vrutkovs commented 2 years ago

Okay, it looks like kubelet leaking memory, not runc

mirek-benes commented 2 years ago

image

The same issue here. Process: root 2009 1 73 Aug12 ? 3-00:33:07 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_id=fedora --node-ip=192.168.110.111 --minimum-container-ttl-duration=6m0s --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-provider= --hostname-override= --provider-id= --pod-infra-container-image=quay.io/openshift/okd-content@sha256:c4e32171a302b1a0d21f936b795b9505b992404b6335bb7e63d3b1bddc0b91ab --system-reserved=cpu=500m,memory=1Gi --v=2

tyronewilsonfh commented 2 years ago

Hi, Also experiencing the same issue in multiple test 4.11 clusters, we have ~30 production 4.10 clusters and have not seen this issue there at all. image

root 1458 74.9 72.8 22674508 11924036 ? Ssl Aug02 14736:44 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_id=fedora --node-ip=10.68.0.64 --minimum-container-ttl-duration=6m0s --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-provider= --hostname-override= --provider-id= --pod-infra-container-image=quay.io/openshift/okd-content@sha256:c4e32171a302b1a0d21f936b795b9505b992404b6335bb7e63d3b1bddc0b91ab --system-reserved=cpu=500m,memory=1Gi --v=2

The node with the memory increase starting ~2022-08-17 07:30 has a cronjob pinned to it, created at 2022-08-16 07:26, which runs every minute to echo "hello openshift" with busybox, to test if this was related to frequently running cronjobs.

Same errors are seen as original post whilst kubelet memory is increasing e.g.

pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubepods besteffort podacd321bc-4db6-468e-87d8-1e1e16c85a9a] err="unable to destroy cgroup paths for cgroup [kubepods besteffort podacd 321bc-4db6-468e-87d8-1e1e16c85a9a] : Timed out while waiting for systemd to remove kubepods-besteffort-podacd321bc_4db6_468e_87d8_1e1e16c85a9a.slice"

vrutkovs commented 2 years ago

Switched to crun and not seeing staggering the memory growth here. Its probably visible on nodes with many containers running?

yaroslavkasatikov commented 2 years ago

Switched to crun and not seeing staggering the memory growth here. Its probably visible on nodes with many containers running?

As for me, I have switched to crun when you wrote and haven't rolled it back

I rewrote the application for removing k8s cronjobs (packed them into container with crontab) and cluster started to be stable. So it seems the reason not in the number of running pods, but in 'starting' pods.

Also I noticed that when I had a lot of cronjobs the node degradation went for this way (from k8s side): 1) Normal state: container scheduled to the node and started in 1-3s 2) Degradation started: container scheduled to the node and started in 20-60s. 3) Degraded state: container scheduled to the node and stuck. 4) After some time the affected node changes a status to NotReady. Pods eviction started. All pods changed their state to Terminating. This state can be fixed only if you hard reboot the node or remove machine.

on steps 1, 2 and 3 all running containers work fine on this node.

titou10titou10 commented 2 years ago

It seems there are fixes related to this problem in the kubernetes v1.24.4 releases notes

Specificaly:

Currently OKD v4.11.0-0.okd-2022-07-29-154152 uses kubernetes v1.24.0+9546431

vrutkovs commented 2 years ago

Excellent, thanks. There's a PR open for 1.24.3 - https://github.com/openshift/kubernetes/pull/1326 - hopefully it would be soon superceded with .4 and merged soon. Once it happens we'll pick it up in okd-machine-os build and release a new stable

vrutkovs commented 2 years ago

https://amd64.origin.releases.ci.openshift.org/releasestream/4-stable/release/4.11.0-0.okd-2022-08-20-022919 includes runc 1.1.3, which should have some kubelet-related fixes.

tyronewilsonfh commented 2 years ago

image

cronjob pinned to one node to scale an alpine deployment between 50 and 0 replicas and reverse, every minute, this test shows the problem within a few hours on 4.11 clusters. no problem found on 4.10 clusters over multiple days.

Screenshot above is from 4.11 cluster, crun machineconfig was added to workers 2022-08-21 (confirmed with crio-status config), Updated cluster to 4.11.0-0.okd-2022-08-20-022919 at ~2022-08-21 12:00, crun machineconfig was removed from workers, rendered config shows creation 21 Aug 2022, 20:43, neither crun, nor runc version 1.1.3 seem to have had effect on kubelet problem.

Kubelet version in 4.11.0-0.okd-2022-08-20-022919 now v1.24.0+4f0dd4d

Below screenshot from 4.10 cluster running same cronjob since 2022-08-19 image TLDR: New release doesn't fix the issue

mlorentz75 commented 2 years ago

Bildschirmfoto 2022-08-23 um 08 27 02 It also seem to happen with AWX 21.x starting Ansible playbooks as pods. I see this every night as a lot of jobs start after midnight. A node reboot fixed the memory usage. Also, no problems running AWX on okd 4.10 clusters.

vrutkovs commented 2 years ago

1.24.4 bump - https://github.com/openshift/kubernetes/pull/1352

danielchristianschroeter commented 1 year ago

By the way, Kubelet consumes also more and more CPU usage in the long run on OKD Version 4.10.0-0.okd-2022-07-09-073606 (vSphere IPI). You can easily reproduce if you are using for example "Red Hat OpenShift Logging" Operator, because it continuously executes cron tasks. But of course, the CPU load of Kubelet is nowhere near as severe as with the version OKD Version 4.11.0-0.okd-2022-08-20-022919. I think the CPU usage of Kubelet also very depended of how much pods are deployed within the cluster/node. (Maybe the overhead of vSphere IPI certainly plays a part in this as well. Compared to the bare metal installation.) After restarting Kubelet on a node, the overall node CPU usage drops from about 3 to 1 core. This is at least for us, besides the node restart, a workaround for this unwanted behavior.

msteenhu commented 1 year ago

Known issue: kubelet with CRI-O not cleaning up cgroups etc when pod gets deleted.

https://github.com/kubernetes/kubernetes/issues/106957 Fix: https://gist.github.com/aneagoe/6e18aaff48333ec059d0c1283b06813f

danielchristianschroeter commented 1 year ago

@vrutkovs is it possible to include this fix?

msteenhu commented 1 year ago

It is rather a workaround so probably not fit to include in release. My bad, wrong wording. It also should be solved in next release I hope. That is wat the remark '1.24.4 bump' is about, I guess. The issue is still open.

Just wanted to point out that this bug is around for many months. Not sure it has always been the same bug but it can be worked around with that nice work from Andrei.

vrutkovs commented 1 year ago

We won't be including a workaround officially - its a kubelet problem and we're waiting for fixes to land, no need to digress from upstream too far.

I don't mind including this workaround as "official" mitigation recipe, if it has been verified to work

msteenhu commented 1 year ago

I can't say for sure it still works in 4.11 but it stabilizes our 'cron job hungry' 4.9 and 4.10 clusters. I will test it in 4.11 and let it know here.

msteenhu commented 1 year ago

It most definitely still cleans up a lot of left over cgroups. Tail of the daemon running on a 4.11 node that runs 10 short lived pods every minute:

2022-09-29T07:29:19+00:00 Removing CGROUP crio-f330b2f8f55951bb5f5445bcd6fa811bdca11447cdfd91b044647e3b846ba4ff.scope and its parent...
2022-09-29T07:29:19+00:00 Scope crio-f6d78fac4951a3296c7e6b118f9e67a0dcb56766005b5bef6776f4ea48652173.scope found under running pod, skipping...
2022-09-29T07:29:19+00:00 Removing CGROUP crio-f86a48b76ea5a734428e11aa18e3a4f3fc9dbfc3894008dcde1d539ac5b50438.scope and its parent...
2022-09-29T07:29:19+00:00 Scope crio-f8f49053255d505d7d6bd2f018dc1985ebf3c34aee673abdd640404db16ef070.scope found under running pod, skipping...
2022-09-29T07:29:19+00:00 Removing CGROUP crio-f8feb93e6aad0dc6b0d7c3dca881a3c5bdd601eff3a60a24a54e8c622abbfd57.scope and its parent...
2022-09-29T07:29:19+00:00 Removing CGROUP crio-fab2dc0351b831ca6b1395e2721b60d49e10fe043c2d75cbbaa26c9c70ee2ef1.scope and its parent...
2022-09-29T07:29:19+00:00 Removing CGROUP crio-fbc1d4ddc0707b6e8e06d4f87dbb6e8adf7c922e07275ad61d5934396a9d41fb.scope and its parent...
2022-09-29T07:29:19+00:00 Scope crio-fbcc50052c55de371d2927d4deb8f5bba8a7b9129e69e220cbadba4e33a93c10.scope found under running pod, skipping...
2022-09-29T07:29:19+00:00 Removing CGROUP crio-fde14b0c95048cdd7c7fa91b9cef3bd3fe8e8120f2c7ae83c330a6a423a2971f.scope and its parent...
2022-09-29T07:29:19+00:00 Removing CGROUP crio-ff49445e58458cb366245cbf1b85c72ba7e8a0868561505eb3355b7ce91f9154.scope and its parent...
2022-09-29T07:29:19+00:00 Removing CGROUP crio-ff6a76f198151084777c816edaea1ad6f2ec1d0acd10ca7e04fb451d89ec841c.scope and its parent...
2022-09-29T07:29:19+00:00 Sleeping for 600 seconds...

System memory spikes and keeps increasing after disabling the workaround, while running the test:

image

When re-enabling the workaround I see this pattern:

image

The workaround sleeps 10 minutes between runs, which explains the sawtooth pattern I guess.

Credits to @aneagoe

DavidHaltinner commented 1 year ago

I put the workaround on a test cluster on tuesday, and the issue has not gone away there, and actually it seems to increase the log spam. i did not use the daemon set, instead i tossed it into a script and am running it on the nodes directly at the moment.

It is going through and appears to be doing what its supposed to, as an example:

Sep 29 07:13:36 MYSERVER.com check.sh[1119]: 2022-09-29T12:13:36+00:00 Starting k8s garbage collector run...
Sep 29 07:13:37 MYSERVER.com check.sh[1119]: 2022-09-29T12:13:37+00:00 Found POD MYSERVERcom-debug unknown to k8s control plane and without any PIDs, will delete it...
Sep 29 07:13:37 MYSERVER.com check.sh[3239730]: Stopped sandbox e62be3eacc7e2286012ff67f2aa969a6131fb4b87ac9d24f7eb98ae08cd346b3
Sep 29 07:13:37 MYSERVER.com check.sh[3239737]: Removed sandbox e62be3eacc7e2286012ff67f2aa969a6131fb4b87ac9d24f7eb98ae08cd346b3

But the memory leaks are still there - I lost two nodes (one control plane, one worker) last night to low memory on this cluster. the CPU usage is still high (3-4 cores normally on these nodes, which are all 4 core) on nodes that are building up.

And while i still get the pod_container_manager_linux.go:192] "Failed to delete cgroup paths" cgroupName=[kubepods besteffort p ........ logs, as well as the kubelet_getters.go:300] "Path does not exist" path="/var/lib/kubelet/pods/e9da40e0-99ac-40d6-802a- ......... logs, i now also get new errors from systemd about the cgroups that the workaround removed: systemd[1]: kubepods-burstable-pod4f247343_7b97_45fe_899b_9b023d5316cf.slice: Failed to open /run/systemd/transient/kubepods-burstable-pod4f247343_7b97_45fe_899b_9b023d5316cf.slice: No such file or directory

This specific cluster i am testing the workaround on has barely any workloads left on it, no jobs (besides the couple built-in ones). but i have some automation that is launching numerous debug pods with regularity, which seems to cause the same symptoms as the cronjobs do (numerous temporary containers)

For me, restarting kubelet helps with the memory right away, but doesnt help the CPU, and gives unpredictable results (unable to launch pods, cannot get logs, cant connect to terminal of pods etc). so i have just been nicely draining and rebooting nodes automatically when the RAM usage gets too high. but before the oom reaper kicks in, as he wont touch kubelet and it just makes the situation worse as the cluster doesnt realize apps have been killed.

Edit: forgot to mention im on 4.11.0-0.okd-2022-08-20-022919

msteenhu commented 1 year ago

Exact same version here in my test cluster. To me this different behaviour does not make sense. I monitored the workaround for a long time some months ago to make sure it worked and didn't cause problems and it seemed to do so...

I will continue my test for longer time and watch for what you are describing and the cluster stability.

DavidHaltinner commented 1 year ago

Exact same version here in my test cluster. To me this different behaviour does not make sense. I monitored the workaround for a long time some months ago to make sure it worked and didn't cause problems and it seemed to do so...

I will continue my test for longer time and watch for what you are describing and the cluster stability.

I'm dropping it on a second cluster right now, to see if I see different results there. one of those nodes is actually in it's slow downward spiral as i type this (spawned the SystemMemeoryExceedsReservation alert 30 minutes ago, which is a good forebearer of the memory leak taking over). but it has a lot of ram to burn through yet, so would take it awhile

msteenhu commented 1 year ago

I'm dropping it on a second cluster right now, to see if I see different results there. one of those nodes is actually in it's slow downward spiral as i type this (spawned the SystemMemeoryExceedsReservation alert 30 minutes ago, which is a good forebearer of the memory leak taking over). but it has a lot of ram to burn through yet, so would take it awhile

Indeed, I was also seeing that for months before I found some help (the workaround) in the openshift-users slack channel: https://kubernetes.slack.com/archives/C6AD6JM17/p1653030671315219

msteenhu commented 1 year ago

You can clearly see the problem and the garbage collecting every 10 minutes using a simple find

# pwd
/sys/fs/cgroup
# find . -type d | wc -l
463

The cgroups (and other things) don't get cleaned up by kubelet so the folder count keeps increasing until the garbage collector drops by... The workaround does clean that up, at least in my 4.11 test cluster.

I will let my test run for a few days because it seems like I have a little bit of memory leaking going on in 4.11, even with the workaround in place but not sure yet.

DavidHaltinner commented 1 year ago

I'm still watching it on this second cluster i added it to as well. I had some results about 10 or 15 minutes after starting the garbage collector, but the SystemMemoryExceedsReservation alert never went away, and now two hours later the memory usage is starting to climb back up again. The g/c is still running, and still cleaning things up once in awhile.

image

Edit: here it is 24 hours later. it seems the garbage collector must help, but it still spikes for awhile before dropping suddenly. So while this is indeed prolonging the uptime, it doesnt seem to fix it for good, and after another spike or two it will still OOM I would suspect. It's on it's third upward climb as we speak. the SystemMemoryExceedsReservation alert has never gone away since it started climbing 24 hours ago either.

image

msteenhu commented 1 year ago

Looks like memory is filling up, even when running the workaroud :-(

image

Definitely show stopper for us to move to 4.11 if the node crashes eventually, which it probably will do. I will try to find out what is staying behind this time.

This crio memory leaking is going on since we started using OKD 4. Maybe OKD should swap crio with containerd, which upstream clearly supports better..

msteenhu commented 1 year ago

Edit: here it is 24 hours later. it seems the garbage collector must help, but it still spikes for awhile before dropping suddenly. So while this is indeed prolonging the uptime, it doesnt seem to fix it for good, and after another spike or two it will still OOM I would suspect. It's on it's third upward climb as we speak. the SystemMemoryExceedsReservation alert has never gone away since it started climbing 24 hours ago either.

On bigger clusters it is not abnormal the default 'SystemMemory' limit is too low. There is a RH issue explaining how to adjust the limit to your needs. We have also 1 production host that is always in 'SysMem' alert but that is not causing problems. If your temporary spikes are too high you could change the garbage collector sleep time if you didn't already.

I am still not sure about my test cluster. The RSS of the system slice keeps on going up very slowly but the total memory usage is not... I am still hoping it is memory that gets flushed eventually by crio or kubelet. Not sure where the RSS is building up yet.

Time will tell. I'd really would like to move to 4.11. Big fan of the dark mode in the console :)

msteenhu commented 1 year ago

My test kept on running over the weekend and I think we can conclude the memory usage stabilises. And thus the workaround is still doing its thing for us. So I believe this are the same bugs that are around since many versions.

For us no problem to move to 4.11.

image

The graph might be misleading: I also added workload in our testing cluster so gues some of the increase in 'system' slice memory can be attributed to that. Most important is that it seems to flatline at some point.

depouill commented 1 year ago

My test kept on running over the weekend and I think we can conclude the memory usage stabilises. And thus the workaround is still doing its thing for us. So I believe this are the same bugs that are around since many versions.

For us no problem to move to 4.11. image

The graph might be misleading: I also added workload in our testing cluster so gues some of the increase in 'system' slice memory can be attributed to that. Most important is that it seems to flatline at some point.

Testing the patch on OKD 4.11.0-0.okd-2022-08-20-022919, nodes crash due to memory exhaust after three days. Hope k8s 1.24.6 will coming soon ? Thank-you very much for the work.

AndrewSav commented 1 year ago

Hope k8s 1.24.6 will coming soon ?

It does not look like it's getting backported https://github.com/kubernetes/kubernetes/commits/v1.24.7-rc.0

marqsbla commented 1 year ago

It looks that k8s 1.24.6 has been merged to openshift/kubernetes:release-4.11: https://github.com/openshift/kubernetes/pull/1381 When can we expect new release of OKD?

vrutkovs commented 1 year ago

We're waiting for it to be promoted in OCP nightlies and bumped up in machine-os-content. Hopefully that'd be this or the following weekend

vrutkovs commented 1 year ago

It doesn't look like the rebase itself is fixing the leak. I've used a reproducer in https://github.com/okd-project/okd/issues/1310#issuecomment-1221965431 and I still see used memory climbing.

Screenshot from 2022-10-13 15-10-09

My prom-foo is not good enough to see if it climbs slower though, but from the looks of it seems https://github.com/kubernetes/kubernetes/issues/106957#issuecomment-1147441007 mitigates the RAM consumption quite well

vrutkovs commented 1 year ago

Also I upgraded to 4.12 (w/ workaround) and applied https://github.com/kubernetes/kubernetes/pull/108855 (using https://github.com/vrutkovs/custom-okd-os/pkgs/container/custom-okd-os/45544090?tag=custom-kubelet) - that seems to have stopped the memory growth and the log spam

bobbypage commented 1 year ago

@vrutkovs @tyronewilsonfh can you please share the reproducer used from https://github.com/okd-project/okd/issues/1310#issuecomment-1221965431?

I've been trying to repro this as part of https://github.com/kubernetes/kubernetes/issues/112151 but haven't had too much luck yet. Are all these issues on CRI-O with crun or has it also been seen with runc/containerd?

vrutkovs commented 1 year ago

I used https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#creating-a-cron-job to trigger memory leak

msteenhu commented 1 year ago

Will this help to create a 4.11 without this bug (backports of the fixed components)? Or what is the way forward?

aneagoe commented 1 year ago

Will this help to create a 4.11 without this bug (backports of the fixed components)? Or what is the way forward?

+1 for this. It would be great to understand the correct procedure to get our stable 4.10 cluster upgraded safely to 4.11.

vrutkovs commented 1 year ago

We're going to need to backport https://github.com/kubernetes/kubernetes/pull/108855 all the way to 1.24 to have it fixed.

Meanwhile seems the workaround is to switch back to cgroups1 (add systemd.unified_cgroup_hierarchy=0 to kernel params) - could anyone confirm that?

bdlink commented 1 year ago

I have what seems to be this issue (in the form of slowly increasing CPU usage requiring rebooting nodes) with kubelet and often also olm. As the cluster dates from cgroups1 days, and has been upgraded I am running cgroups1 (mount | grep group confirms. cgroups2 is present, but not being used.

AndrewSav commented 1 year ago

I am running cgroups1 (mount | grep group confirms. cgroups2 is present, but not being used.

same here - and same problem

marqsbla commented 1 year ago

sorry, if the question is irrelevant, but does latest stabel 4.11.0-0.okd-2022-10-15 release fix the issue in any way @vrutkovs ? Over the weekend 2/3 masters in my 4.11.0-0.okd-2022-08-20-022919 production cluster were crushed due to memory leakage. Hopefully I checked the alerts Saturday between the machines crushes :).

I also checked that without any modifications system seems to use both, cgroups version 1 and 2. Should I switch to cgroups v1 as you proposed? For reasons I would like not to mention, I can't use my test cluster for that, so I would have to the direct changes in production, what I would prefer not to do without knowing that it would help...

vrutkovs commented 1 year ago

does latest stabel 4.11.0-0.okd-2022-10-15 release fix the issue in any way

Most likely - not. It includes a memory leak in job controller, but seems more leaks are present:

BonzTM commented 1 year ago

sorry, if the question is irrelevant, but does latest stabel 4.11.0-0.okd-2022-10-15 release fix the issue in any way @vrutkovs ? Over the weekend 2/3 masters in my 4.11.0-0.okd-2022-08-20-022919 production cluster were crushed due to memory leakage. Hopefully I checked the alerts Saturday between the machines crushes :).

I also checked that without any modifications system seems to use both, cgroups version 1 and 2. Should I switch to cgroups v1 as you proposed? For reasons I would like not to mention, I can't use my test cluster for that, so I would have to the direct changes in production, what I would prefer not to do without knowing that it would help...

It does not seem to fix it. I upgraded last night and awoke to 2 worker nodes unresponsive.

@vrutkovs care to share your upgrade directions for OKD 4.12?

vrutkovs commented 1 year ago

care to share your upgrade directions for OKD 4.12?

oc adm upgrade --force --allow-explicit-upgrade --allow-upgrade-with-warnings --to-image=<pick nightly pullspec from https://amd64.origin.releases.ci.openshift.org/#4.12.0-0.okd>. Warning: its still being developed and nightlies are removed after 72hrs. Even with 4.12 you still need an OS with custom patched kubelet