Closed dzacball closed 4 months ago
Just to make sure it's not specific to a combination of HEPs/GNPs/NetworkSets, I created a cluster with v3.27.3
and deleted all related resources. After that, I restarted calico-node, and the issue is still present:
$ k get networksets.crd.projectcalico.org -A
No resources found
$ k get hostendpoints.crd.projectcalico.org
No resources found
$ k get globalnetworkpolicies.crd.projectcalico.org
No resources found
$ k get networkpolicies.crd.projectcalico.org -A
No resources found
$ k get networkpolicies.networking.k8s.io -A
No resources found
$ k rollout restart daemonset -n kube-system calico-node
daemonset.apps/calico-node restarted
$ k get po -n kube-system | grep calico
calico-kube-controllers-5b784b4785-7898t 1/1 Running 0 91m
calico-node-gk7ft 1/1 Running 0 78s
calico-node-ngz97 1/1 Running 0 2m59s
calico-typha-5667cc7f45-d7lds 0/1 Pending 0 91m
calico-typha-5667cc7f45-kp9c4 1/1 Running 0 91m
calico-typha-5667cc7f45-l5pxg 1/1 Running 0 91m
$ k top po -n kube-system | grep calico
calico-kube-controllers-5b784b4785-7898t 3m 16Mi
calico-node-gk7ft 285m 125Mi <-- CPU should be much lower, about 1/10th of this
calico-node-ngz97 247m 132Mi <-- CPU should be much lower, about 1/10th of this
calico-typha-5667cc7f45-kp9c4 3m 20Mi
calico-typha-5667cc7f45-l5pxg 2m 21Mi
$ k logs -n kube-system calico-node-ngz97 --since 1s
Defaulted container "calico-node" out of: calico-node, install-cni (init)
2024-05-24 08:24:24.200 [INFO][59] felix/xdp_state.go 200: Retrying after an XDP update failure...
2024-05-24 08:24:24.244 [INFO][59] felix/xdp_state.go 200: Retrying after an XDP update failure...
2024-05-24 08:24:24.289 [INFO][59] felix/xdp_state.go 200: Retrying after an XDP update failure...
2024-05-24 08:24:24.332 [INFO][59] felix/xdp_state.go 200: Retrying after an XDP update failure...
2024-05-24 08:24:24.372 [INFO][59] felix/xdp_state.go 200: Retrying after an XDP update failure...
2024-05-24 08:24:24.428 [INFO][59] felix/int_dataplane.go 2154: Applying XDP actions did not succeed, disabling XDP error=failed to resync: failed to load BPF program (/usr/lib/calico/bpf/filter.o): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error loading BTF: Invalid argument(22)
libbpf: magic: 0xeb9f
version: 1
flags: 0x0
...
...
...
Worker env:
# cat /etc/os-release | grep VERSION
VERSION="20.04.6 LTS (Focal Fossa)"
VERSION_ID="20.04"
VERSION_CODENAME=focal
# uname -r
5.4.0-182-generic
Could you check your syslog for any networkd-dispatcher
, NetworkManager
, ethtool
or systemd-udevd
flooding? I just experienced something similar, where my master nodes didn't show any particular kubernetes process loading up the node (at a glance on k9s), but the node itself was at x10 the usual load (top), after upgrading from 3.25
to 3.28
. Don't think my problem is BPF related.
This, apparently, only affected my on premise arch.
focal
systemd 245 (245.4-4ubuntu3.23)
kernel: 5.4.0-152-generic
k8s: v1.25.12
May 24 13:23:08 master networkd-dispatcher[688]: ERROR:Unknown interface index 1419614 seen even after reload
May 24 13:23:08 master networkd-dispatcher[688]: WARNING:Unknown index 1419616 seen, reloading interface list
May 24 13:23:08 master systemd-udevd[701481]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
[...]
May 24 13:23:08 master systemd-udevd[701481]: ethtool: could not get ethtool features for calico_tmp_B
May 24 13:23:08 master systemd-udevd[701481]: Could not set offload features of calico_tmp_B: No such device
@spacegaucho similar stuff here:
# cat /var/log/syslog | tail -n 10
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 networkd-dispatcher[785]: WARNING:Unknown index 450967 seen, reloading interface list
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10804]: calico_tmp_B: Failed to get link config: No buffer space available
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10802]: calico_tmp_A: Failed to get link config: No buffer space available
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 networkd-dispatcher[785]: WARNING:Unknown index 450969 seen, reloading interface list
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10804]: calico_tmp_B: Failed to get link config: No buffer space available
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10802]: calico_tmp_A: Failed to get link config: No buffer space available
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-networkd[679]: calico_tmp_B: Failed to wait for the interface to be initialized: No such device
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 networkd-dispatcher[785]: WARNING:Unknown index 450971 seen, reloading interface list
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10802]: calico_tmp_A: Failed to get link config: No buffer space available
May 24 14:39:03 pres-cp83edv10o3dr35lr8g0-calidebug-default-00000228 systemd-udevd[10804]: calico_tmp_B: Failed to get link config: No buffer space available
Could you check your calico-node logs to see if ther's anything like the logs below?
[INFO][60] felix/int_dataplane.go 2154: Applying XDP actions did not succeed, disabling XDP error=failed to resync: failed to load BPF program (/usr/lib/calico/bpf/filter.o): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error loading BTF: Invalid argument(22)
@dzacball looks like we are onto something here.
No, there's nothing referencing the events you mentioned:
k stern -n kube-system calico-node-* --since=48h --no-follow | grep 'XDP actions did not succeed'
+ calico-node-zlbwb › calico-node
+ calico-node-gm87s › calico-node
+ calico-node-4fp5p › calico-node
+ calico-node-vkh8q › calico-node
+ calico-node-pphxf › calico-node
+ calico-node-cdp9k › calico-node
+ calico-node-8smsj › calico-node
+ calico-node-nfww9 › calico-node
+ calico-node-lptlx › calico-node
- calico-node-cdp9k › calico-node
- calico-node-4fp5p › calico-node
- calico-node-vkh8q › calico-node
- calico-node-nfww9 › calico-node
- calico-node-zlbwb › calico-node
- calico-node-8smsj › calico-node
- calico-node-lptlx › calico-node
- calico-node-gm87s › calico-node
- calico-node-pphxf › calico-node
But the issue persists. Had to rollback to 3.25 and that "fixed" the issue. I'm browsing around for any additional information.
@spacegaucho can you possibly check if the issue is gone with upgrading bpftool
?
Sure! @mazdakn how should I go about that? My calico-node
pods currently run the following:
bpftool --version
bpftool v5.3.0
Should I test a v3.29.0-0.dev image? Or would I need to update the binary manually inside the pods? Thanks.
@spacegaucho Thanks for being OK to test it. I'll provide you an image with the updated bpftool
.
I got similar issue here. Not very sure about the root cause. But once I upgraded my kernel from 5.4.0-182-generic to 5.15.0-107-generic. the issue disappeared.
referring to the kernel version here: https://ubuntu.com/security/livepatch/docs/livepatch/reference/kernels I what I did it: install the HWE kernel. (I'm using Ubuntu 20.04 LTS)
So, not sure if this issue is related to specific kernel version or it's actually related the HWE kernel
@spacegaucho here is an image with the updated bpftool (v7.4). Please give it a try, and let me know if it works for you.
mazdakrn/calico-node:latest
This is based on master branch, but should be OK to test it in a v3.28 cluster.
@dzacball @spacegaucho can you also disable XDP mode in iptables by setting xdpEnabled
to false
in felixconfiguration
resource. This should prevent felix trying to load iptables XDP objects.
Ref: https://docs.tigera.io/calico/latest/reference/resources/felixconfig
@mazdakn I can verify that if I set xdpEnabled
to false
, the issue is gone.
@dzacball can you also try the image I mentioned above in one of your test cluster?
@mazdakn I tested your image - it works, issue is gone. (As I already mentioned, I also did some tests a few weeks back with a self-built image, using v3.28.0 + latest bpftool, worked as well)
@dzacball for testing.
sorry for not replying earlier, will try it asap.
@dzacball @spacegaucho can you also disable XDP mode in iptables by setting
xdpEnabled
tofalse
infelixconfiguration
resource. This should prevent felix trying to load iptables XDP objects.Ref: https://docs.tigera.io/calico/latest/reference/resources/felixconfig
@spacegaucho here is an image with the updated bpftool (v7.4). Please give it a try, and let me know if it works for you.
mazdakrn/calico-node:latest
This is based on master branch, but should be OK to test it in a v3.28 cluster.
can confirm this fixed the issue for me as well.
3.25
, checked that there where no matches for XDP actions did not succeed
string, as I tested in a previous post.3.28
(using this upgrade path) and verified that the string was present and flooding.upgrade.yaml
file to include the image provided by @dzacball, applied it and verified that after the image was set in place and the pods restarted, the log flooding stopped.@spacegaucho thanks for testing and confirmation. Did you also managed to set xdpEnabled
to false
and see if the issue is gone?
xdpEnabled
no, sorry, should i test that directly in vanilla 3.28?
Yes, with vanilla 3.28 image.
I notice that #8880 is flagged for 3.29, is it possible this will also get a 3.28 backport? I was about to go down the rabbit hole of switching to the eBPF datapath and exploring XDP, but don't want to run into this issue.
@isugimpy we will definitely backport it to v3.28, and also v3.27. However, this issue is not relevant to eBPF dataplane. The issue discussed here is about XDP used in iptables dataplane. eBPF dataplane is NOT affected.
Oh, brilliant. I didn't catch that this was unique to iptables! Thank you!
@mazdakn Do you have an ETA for upcoming 3.27 and 3.28 releases that will already contain this fix PR?
@dzacball v3.27.4 is expected to be released early July (probably the first week). There is no ETA for 3.28 patch release yet.
@mazdakn Thank you.
Closing since the fixes (including back ports to v3.27 and v3.28) are merged now, and will be available in the next patch releases.
@dzacball 3.28.1 is planned to be released in the second half of July.
thanks for the heads-up
@mazdakn Do you have any update when the release will arrive? Thank you!
@mihivagyok we are in the process of releasing 3.28.1 atm. It most likely will be released early next week.
Expected Behavior
Calico using a reasonable amount of CPU. No XDP/BPF related error logs in
calico-node
.Current Behavior
Calico using about 10x times it's usual CPU. In
calico-node
logs, I can see messages likePossible Solution
If I rebuild calico-node using an up-to-date version of
bpftool
(tried withv7.4
and worked), the issue is gone.Steps to Reproduce (for bugs)
Not sure about how to reproduce. It happens in our env.. and it was introduced by updates to map definitions in https://github.com/projectcalico/calico/pull/8610.
Context
We can not use affected versions of Calico due to the high CPU load (and of course we aren't sure what other issues could arise)
Your Environment
Calico
v3.27.3
and/orv3.28.0
. K8s 1.28 + Ubuntu20.04.6
workers with5.4.0-177-generic
kernel.More details
I could confirm that this issue was introduced by this PR: https://github.com/projectcalico/calico/pull/8610. It is present in Calico
v3.27.3
andv3.28.0
. If I revert the PR, and build my own calico-node image on top ofv3.28.0
, the issue is gone. Also, if I rebuild calico-node with an up-to-date version ofbpftool
(tried withv7.4
and worked), the issue is gone.Related slack thread: https://calicousers.slack.com/archives/CPTH1KS00/p1713552425270619