kumahq / kuma

🐻 The multi-zone service mesh for containers, Kubernetes and VMs. Built with Envoy. CNCF Sandbox Project.
https://kuma.io/install
Apache License 2.0
3.55k stars 327 forks source link

Kuma CNI blocks PODs start #10461

Open rdavyd opened 2 weeks ago

rdavyd commented 2 weeks ago

What happened?

Ran into this issue during OS upgrade. kuma-cni and several other PODs could not start and were hanging in the unknown mode after node reboot. Containerd was spamming the following message (example for the cert-exporter):

Jun 13 11:21:29 myworker(MASKED) containerd[777]: time="2024-06-13T11:21:29.717025442Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:cert-exporter-sd6ln,Uid:61df9ed3-6c50-4bf7-80ea-035be14ef1a0,Namespace:monitoring,Attempt:1,} failed, error" error="failed to setup network for sandbox \"92042607aafde23c6fd44a7901d957df089b7ff2b2980dcd133cd38b2826be6f\": plugin type=\"kuma-cni\" name=\"kuma-cni\" failed (add): more than one process '/install-cni' running on a node, this should not happen"

Didn't find the install-cni process with ps ax | grep -i cni on the node. Issue is very rare and happened only twice for about 80 nodes. Issue is mitigated by removing kuma-cni from CNI config /etc/cni/net.d/10-kuberouter.conflist, just pasting there original kube-router conf. After kuma-cni POD starts it updates the CNI config with normal values.

My guess that != should be changed to > for the following line:

https://github.com/kumahq/kuma/blob/413bddfb40f22f3c2f2a52deef78d13be3c6a3c0/app/cni/pkg/cni/main.go#L114-L116

Installation type: On prem, kubespray K8s: 1.29.5 OS: Ubuntu 22.04 Network plugin: kube-router 2.0.1

slonka commented 2 weeks ago

More so, we shouldn't be erroring out on not being able to get stderr - we could just continue without log redirection.