rancher / rke

Rancher Kubernetes Engine (RKE), an extremely simple, lightning fast Kubernetes distribution that runs entirely within containers.
Apache License 2.0
3.22k stars 583 forks source link

High CPU load due to calico errors on Ubuntu 20.04 - Update to calico needed #3648

Closed tmsdce closed 2 months ago

tmsdce commented 3 months ago

This issue is related to this Calico issue : https://github.com/projectcalico/calico/issues/8856 Everything is explained in the issue thread but here's a quick view of the logs we're seeing in calico which causes high CPU usage

libbpf: Error loading .BTF into kernel: -22.
Error: failed to open object file
, will proceed anyway.
2024-08-09 08:53:08.290 [WARNING][53] felix/int_dataplane.go 1747: failed to wipe the XDP state error=failed to load BPF program (/usr/lib/calico/bpf/filter.o): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error loading BTF: Invalid argument(22)
libbpf: magic: 0xeb9f

Versions 1.29.X and 1.30.X deployed by RKE use respectively Calico 3.27.3 and 3.28.0 which are concerned by the above issue. The bug was fixed in versions 3.27.4 and 3.28.1.

Can you cut a new release for RKE including these fixed versions of Calico ?

RKE version:

1.6.1

Docker version: (docker version,docker info preferred)

Client: Docker Engine - Community
 Version:           24.0.9
 API version:       1.43
 Go version:        go1.20.13
 Git commit:        2936816
 Built:             Thu Feb  1 00:48:08 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.9
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.13
  Git commit:       fca702d
  Built:            Thu Feb  1 00:48:08 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
5.4.0-192-generic

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

vSphere
wzrdtales commented 3 months ago

bump...

tmsdce commented 3 months ago

This is causing some clusters to hang due to high CPU usage and memory leaking. We have to regularly rollout calico daemonset to free up some resources but this is not a suitable workaround. Embedding fixed versions of calico should be enough to resolve the issue.

Maybe @jiaqiluo or @kinarashah can have a look at this ?

jiaqiluo commented 3 months ago

Hi @tmsdce, thank you for reporting the issue. Our team will follow up.

wzrdtales commented 2 months ago

when is this going to be released, this is a production critical bug

jiaqiluo commented 2 months ago

hi @wzrdtales, please follow the linked issues in the rancher/rancher repo for milestones and progress.

wzrdtales commented 2 months ago

Thanks, I just thought SUSE is a little more sensible to issues that are hitting their users' production systems and causing downtimes.

The worst is, there is no easy way to upgrade these calico versions without rke integrating it, as far as I could see.

jiaqiluo commented 2 months ago

I strongly recommend using the support channel to emphasize the urgency of fixing the bug. This will help ensure that the product management team prioritizes the issue.

nickvth commented 2 months ago

@wzrdtales as workaround you can change your container images to 3.28.1 in the deployment and daemonset.

Screenshot 2024-09-13 at 08 16 31
rishabhmsra commented 2 months ago

Validated the calico version bump as a part of KDM August patch testing:

Fresh install:

calico

Upgrade checks:

calico1

v2.8 KDM August patch testing also passed successfully.

wzrdtales commented 2 months ago

@wzrdtales as workaround you can change your container images to 3.28.1 in the deployment and daemonset. Screenshot 2024-09-13 at 08 16 31

well the operator does reset the image tag version upon change again (at least with rke2)

tmsdce commented 2 months ago

@wzrdtales You can override the system images used by RKE in your rke config file : https://rke.docs.rancher.com/config-options/system-images

If you adapt the calico related images, you should be able to fix the issue while waiting for a new RKE release

wzrdtales commented 2 months ago

this is the docs for rke1, not rke2

tmsdce commented 2 months ago

The bug was opened for RKE1 so I thought you were using RKE1. I understand the issue is also valid for RKE2 My mistake

jiaqiluo commented 2 months ago

Hi @wzrdtales, I did a quick search and found the following info:

1/ Calico 3.28.1, which contains the fix for this issue, is used in the following RKE2 versions:

2/ Those RKE2 versions will be available in Rancher v2.8.x and v2.9.x release lines once 2.8.8 and 2.9.2 are released , please check the links for more details:

mitulshah-suse commented 2 months ago

2.8 and 2.9 validations have been done as part of the associated issues. https://github.com/rancher/rancher/issues/47024#issuecomment-2352694271 https://github.com/rancher/rancher/issues/47046#issuecomment-2348781406 Closing this issue.

tmsdce commented 2 months ago

Hi @jiaqiluo

I see the issue is resolved referencing only Rancher/RKE2 fixes. Any ETA for a fix for RKE1 ?

mitulshah-suse commented 2 months ago

@tmsdce Will be released with 2.8.8 and 2.9.2 as you can see on the milestones of the associated tickets. RKE1 bump validation was completed with the fix and that is why this ticket is closed. https://github.com/rancher/rke/issues/3648#issuecomment-2354765374

tmsdce commented 2 months ago

Ok, thanks for your quick reply @mitulshah-suse

wzrdtales commented 2 months ago

@jiaqiluo @mitulshah-suse 3.28.1 is broken as well if kubernetes endpoint is configured, there is a fix for that in 3.28.2