Closed 8Iue closed 11 months ago
Hi @8Iue, thanks for the detailed report! Could you share some more details about your setup, say, the topology and BGP peerings? Is there any chance that the address you used in the can-reach
autodetection config could be reached by multiple NICs and thus Calico picks up the wrong one? Could you try using the interface=XXX
method to the correct interface to check if this fixes the issue?
thanks for the feedback @coutinhop!
Topology and BGP Peerings:
master00
, master01
, and master02
. All of these are virtual machines.worker00
, which is a physical server with 4 NICs.Internal IPv4 Addresses:
master00
: 172.16.0.180master01
: 172.16.0.181master02
: 172.16.0.182worker00
: 172.16.0.169I've set the IP_AUTODETECTION_METHOD
to target the default gateway using can-reach=172.16.0.1
. When I run kubectl describe pod
for the Calico pod on worker00
, it correctly returns the IP value as 172.16.0.169
:
Name: calico-node-j468d
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: worker00/172.16.0.169
Start Time: Wed, 23 Aug 2023 11:01:41 +0200
Labels: controller-revision-hash=76c9f59b5f
k8s-app=calico-node
pod-template-generation=1
Annotations: <none>
Status: Running
IP: 172.16.0.169
IPs:
IP: 172.16.0.169
Controlled By: DaemonSet/calico-node
Given the topology and the IP addresses, there's only one active NIC on worker00
that can reach 172.16.0.1
. However, I understand your point about the possibility of multiple NICs being able to reach the specified address. I've also tried setting the interface=XXX
, however, same behaviour.
@coutinhop Just an update as I've started again to look into this.
Whilst restarting the Calico resources manually (deleting and recreating them), the Calico Kube Controller attempted to start on the worker00
, and I've found these logs entries in the /var/log/calico/cni/cni.log
(Calico's log_file_path
parameter):
2023-09-25 19:17:48.678 [INFO][4181011] k8s.go 411: Added Mac, interface name, and active container ID to endpoint ContainerID="cd442b74c97389a8bbe6c81b23cdae13e8ef1ee53e429d4fc602917322f7254d" Namespace="kube-system" Pod="calico-kube-controllers-7b7d9b66c-t2x8x" WorkloadEndpoint="worker00-k8s-calico--kube--controllers--7b7d9b66c--t2x8x-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"worker00-k8s-calico--kube--controllers--7b7d9b66c--t2x8x-eth0", GenerateName:"calico-kube-controllers-7b7d9b66c-", Namespace:"kube-system", SelfLink:"", UID:"ef780853-c94c-4d94-9c18-69db2a4e3312", ResourceVersion:"5723018", Generation:0, CreationTimestamp:time.Date(2023, time.September, 25, 19, 17, 23, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"k8s-app":"calico-kube-controllers", "pod-template-hash":"7b7d9b66c", "projectcalico.org/namespace":"kube-system", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"calico-kube-controllers"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ZZZ_DeprecatedClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"worker00", ContainerID:"cd442b74c97389a8bbe6c81b23cdae13e8ef1ee53e429d4fc602917322f7254d", Pod:"calico-kube-controllers-7b7d9b66c-t2x8x", Endpoint:"eth0", ServiceAccountName:"calico-kube-controllers", IPNetworks:[]string{"10.244.131.175/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.kube-system", "ksa.kube-system.calico-kube-controllers"}, InterfaceName:"calic82c0ab436c", MAC:"56:a7:35:b3:71:62", Ports:[]v3.WorkloadEndpointPort(nil), AllowSpoofedSourcePrefixes:[]string(nil)}}
The WorkloadEndpoint
parameter is appending to the end eth0
, and the Endpoint
parameter has been set to eth0
, however, the worker00
node does not have an eth0 interface:
$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether b0:7b:25:f8:07:90 brd ff:ff:ff:ff:ff:ff
altname enp225s0f0
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether b0:7b:25:f8:07:91 brd ff:ff:ff:ff:ff:ff
altname enp225s0f1
4: enp161s0f0np0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether 5c:6f:69:5d:2f:a0 brd ff:ff:ff:ff:ff:ff
5: enp161s0f1np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether 5c:6f:69:5d:2f:a1 brd ff:ff:ff:ff:ff:ff
6: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
link/ether 4e:98:af:ec:91:9a brd ff:ff:ff:ff:ff:ff
42297: cali77b90bb33ab@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-305c6573-6d70-2d14-07a8-b54f568be31f
42897: cali6973da3279e@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-4bf77a94-262b-d9aa-2c19-5dcad76add85
42900: calic82c0ab436c@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-09626670-9935-81fa-4280-1d69627ffd5d
35564: cni0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 1a:0d:65:51:2a:6f brd ff:ff:ff:ff:ff:ff
42233: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
The IP_AUTODETECTION_METHOD
parameter was set in the DaemonSet
for the calico-node
resource, not on the Calico Controller, but I couldn't find a parameter for the interface selection on the Controller documentation: https://docs.tigera.io/calico/3.24/reference/kube-controllers/configuration
Additionally, this error happens only on pods that are attempting to start on the worker00
node, selecting the wrong eth0
interface.
Here are some additional logs, excluding the pods names, for further info:
$ grep "2023-09-25" /var/log/calico/cni/cni.log | grep -vE "controllers|runner|cert--manager" | grep eth0
2023-09-25 23:57:53.148 [INFO][223069] dataplane_linux.go 524: Deleting workload's device in netns. ContainerID="a86aebb592ea77f2d2d5312ad486048a4d96f9cc3017bd37cc86e5e4a5bf34c2" iface="eth0" netns="/var/run/netns/cni-41dca964-e6c0-ce17-a4cd-31f0e92f29c9"
2023-09-25 23:57:53.148 [INFO][223069] dataplane_linux.go 535: Entered netns, deleting veth. ContainerID="a86aebb592ea77f2d2d5312ad486048a4d96f9cc3017bd37cc86e5e4a5bf34c2" iface="eth0" netns="/var/run/netns/cni-41dca964-e6c0-ce17-a4cd-31f0e92f29c9"
2023-09-25 23:57:53.209 [INFO][223069] dataplane_linux.go 569: Deleted device in netns. ContainerID="a86aebb592ea77f2d2d5312ad486048a4d96f9cc3017bd37cc86e5e4a5bf34c2" after=61.2644ms iface="eth0" netns="/var/run/netns/cni-41dca964-e6c0-ce17-a4cd-31f0e92f29c9"
2023-09-25 23:57:53.438 [INFO][223139] dataplane_linux.go 520: CleanUpNamespace called with no netns name, ignoring. ContainerID="4c6546205912a20a2e31a9b5fccbab87f993da3516eb7e3a2b813fa7cc3e6d8a" iface="eth0" netns=""
2023-09-25 23:57:54.104 [INFO][223204] dataplane_linux.go 520: CleanUpNamespace called with no netns name, ignoring. ContainerID="a86aebb592ea77f2d2d5312ad486048a4d96f9cc3017bd37cc86e5e4a5bf34c2" iface="eth0" netns=""
2023-09-25 23:57:54.248 [INFO][223270] dataplane_linux.go 520: CleanUpNamespace called with no netns name, ignoring. ContainerID="4c6546205912a20a2e31a9b5fccbab87f993da3516eb7e3a2b813fa7cc3e6d8a" iface="eth0" netns=""
These logs get repeated, but I believe these are only symptoms of the wrong interface selection.
The WorkloadEndpoint parameter is appending to the end eth0, and the Endpoint parameter has been set to eth0, however, the worker00 node does not have an eth0 interface:
I think this is a red herring - the WorkloadEndpoint eth0 is referring to the name of the interface inside of your container, which will be eth0. It's not tied to the host interfaces at all.
Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
This suggests that BIRD is failing to start on that node, which would makes sense why pods on that node are not functioning. I think this is what we need to diagnose.
Do you have a full calico-node log output (e.g., from kubectl logs
) from the affected node I could take a look at? That might have a clue as to why BIRD is crashing or failing to start.
@caseydavenport Thanks for looking into this, here's the logs of the affected calico-node-<id>
pod: http://logpaste.com/vubNfLZm
And also, here are the events happening on the pod:
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m19s default-scheduler Successfully assigned kube-system/calico-node-b24lj to worker00
Normal Started 6m15s kubelet Started container calico-node
Normal Created 6m15s kubelet Created container calico-node
Warning Unhealthy 6m14s kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
Normal Killing 6m14s kubelet Stopping container calico-node
Normal SandboxChanged 6m12s kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 6m12s (x2 over 6m19s) kubelet Container image "docker.io/calico/cni:v3.24.5" already present on machine
Normal Created 6m12s (x2 over 6m19s) kubelet Created container upgrade-ipam
Normal Started 6m12s (x2 over 6m19s) kubelet Started container upgrade-ipam
Normal Started 6m11s (x2 over 6m18s) kubelet Started container install-cni
Normal Pulled 6m11s (x2 over 6m18s) kubelet Container image "docker.io/calico/cni:v3.24.5" already present on machine
Normal Created 6m11s (x2 over 6m18s) kubelet Created container install-cni
Normal Created 6m9s (x2 over 6m16s) kubelet Created container mount-bpffs
Normal Started 6m9s (x2 over 6m16s) kubelet Started container mount-bpffs
Normal Pulled 6m9s (x2 over 6m16s) kubelet Container image "docker.io/calico/node:v3.24.5" already present on machine
Normal Pulled 6m8s (x2 over 6m15s) kubelet Container image "docker.io/calico/node:v3.24.5" already present on machine
Warning BackOff 71s (x31 over 5m55s) kubelet Back-off restarting failed container
calico-node -show-status
From a working calico-node
podFor some more info, I've ran the calico-node -show-status
command on a calico-node pod that is working fine (for instance running on one of the master nodes), and whilst the affected calico-node
pod on the worker node is still running, BGP connections are set as Established
:
bird v4 BGP peers
+--------------+-----------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | BGPSTATE |
+--------------+-----------+-------+------------+-------------+
| 172.16.0.180 | Mesh | up | 2023-09-26 | Established |
| 172.16.0.181 | Mesh | up | 2023-09-26 | Established |
| 172.16.0.169 | Mesh | up | 09:14:00 | Established |
+--------------+-----------+-------+------------+-------------+
Once the affected calico-node
gets killed and restarted on CrashLoopBackOff, the BGP peer is then Active
:
bird v4 BGP peers
+--------------+-----------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | BGPSTATE |
+--------------+-----------+-------+------------+-------------+
| 172.16.0.180 | Mesh | up | 2023-09-26 | Established |
| 172.16.0.181 | Mesh | up | 2023-09-26 | Established |
| 172.16.0.169 | Mesh | start | 09:16:01 | Active |
+--------------+-----------+-------+------------+-------------+
I've also found this open issue which has an open PR here: https://github.com/projectcalico/calico/issues/7472 - I've then tried configuring this livenessProbe
command on the calico-node
-DaemonSet
resource and calico-kube-controllers
-Deployment
resource:
livenessProbe:
exec:
command:
- bash
- -c
- /bin/calico-node -felix-live -bird-live && /usr/bin/birdcl -s /var/run/calico/bird.ctl show status
But again, no luck.
It does appear from both the show-status output and the log that BIRD is actually starting up and running successfully, so it's odd that the liveness probe is indicating that it cannot connect to BIRD. It seems to suggest this might just be a problem with the liveness probe. It seems strange that it would happen just on one node.
Is this a fresh install? Or has it been upgraded from a previous version? Are all nodes running the same version of Calico?
One thing to try would be to just remove the livenessProbe
altogether and see if Calico stabilizes. If so, it further suggests that it's just a probe issue of some sort.
@caseydavenport thanks for the input!
NOTE: Firewall is disabled for the time being, although IPTables have been automatically updated with new Kubernetes rules upon reconfiguring and re-setting the worker node.
It seems strange that it would happen just on one node.
Correct, and this node is the only physical node of the two clusters that we have (staging and production).
This physical worker has multiple interfaces as mentioned before, and the calico-node
-DaemonSet
has been configured to use the IP_AUTODETECTION_METHOD
to overcome this possible issue.
Is this a fresh install? Or has it been upgraded from a previous version? Are all nodes running the same version of Calico?
This is a fresh install on our Production environment, and the worker node (worker00
) has been reset (many times) with kubeadm reset
, as well as removed all the k8s+calico CNI resources from the worker node multiple times, by running this:
master00~$ kubectl cordon worker00
master00~$ kubectl drain worker00 --ignore-daemonsets --delete-emptydir-data
master00~$ kubectl delete node worker00
worker00~$ kubeadm reset --cri-socket "unix:///run/containerd/containerd.sock" -f
worker00~$ rm -rf /var/lib/etcd /var/lib/kubelet /var/lib/dockershim /var/run/kubernetes /var/lib/cni /etc/cni /opt/cni /etc/kubernetes
worker00~$ shutdown -r now
Upon rebooting the system, all routes
and network-config have been removed
Also, all Kubernetes packages have been APT-pinned to the same versions across all nodes (staging and production).
This is the Docker-Image version SHA: docker.io/calico/node@sha256:5972ad2bcbdc668564d3e26960c9c513b2d7b05581c704747cf7c62ef3a405a6
livenessProbe
+ readinessProbe
I've been troubleshooting this for the past... hours - and I did indeed try to completely remove the livenessProbe
and readinessProbe
from the calico-node
-DaemonSet
, but all I'm getting is a SandboxChanged
that for some reasons is killing the pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 14m default-scheduler Successfully assigned kube-system/calico-kube-controllers-54fd75b97c-g48bc to worker00
Normal Created 14m (x2 over 14m) kubelet Created container calico-kube-controllers
Normal Started 14m (x2 over 14m) kubelet Started container calico-kube-controllers
Normal SandboxChanged 13m (x2 over 14m) kubelet Pod sandbox changed, it will be killed and re-created.
Warning BackOff 13m (x4 over 13m) kubelet Back-off restarting failed container
Normal Killing 8m21s (x5 over 14m) kubelet Stopping container calico-kube-controllers
Normal Pulled 3m56s (x7 over 14m) kubelet Container image "docker.io/calico/kube-controllers:v3.24.5" already present on machine
For extra information, this is the Docker Image being used in the calico-node
container: docker.io/calico/node@sha256:5972ad2bcbdc668564d3e26960c9c513b2d7b05581c704747cf7c62ef3a405a6
And in the kubectl describe
the Exit code
was set to 0
, but again, it seems like that the main-process of the container is exiting for whatever reason....
I've been troubleshooting this for the past... hours - and I did indeed try to completely remove the livenessProbe and readinessProbe from the calico-node-DaemonSet, but all I'm getting is a SandboxChanged that for some reasons is killing the pod:
This is at least a clue that it's not terminating because of the probes, rather something is actually going wrong.
If you use kubectl logs
passing the -p
argument to retrieve the logs from the last running container, is there anything interesting at the very end? Afraid the logpaste link above has expired and I can't recall if it included the -p
arg.
Correct, and this node is the only physical node of the two clusters that we have (staging and production).
This does seem like a suspicious difference between this node and the others, but everything you've described so far in terms of configuration and process seems good to me.
If you use kubectl logs passing the -p argument to retrieve the logs from the last running container, is there anything interesting at the very end? Afraid the logpaste link above has expired and I can't recall if it included the -p arg.
Ah sorry, the link expired and didn't realise it does.
The logs are almost-always ending with the same last line, which is the following:
2023-09-28 09:07:03.157 [INFO][97] felix/health.go 242: Overall health status changed newStatus=&health.HealthReport{Live:true, Ready:true, Detail:"+------------------+---------+----------------+-----------------+--------+\n| COMPONENT | TIMEOUT | LIVENESS | READINESS | DETAIL |\n+------------------+---------+----------------+-----------------+--------+\n| async_calc_graph | 20s | reporting live | reporting ready | |\n| felix-startup | 0s | reporting live | reporting ready | |\n| int_dataplane | 1m30s | reporting live | reporting ready | |\n+------------------+---------+----------------+-----------------+--------+"}
This seems to be stating that the the felix
is healthy and listening, but it gets killed right after that.
Here are the logs: https://pastebin.com/MQ8CZVrS
This does seem like a suspicious difference between this node and the others, but everything you've described so far in terms of configuration and process seems good to me.
Correct, also the staging has the same configuration as we're deploying k8s via Ansible, but different outcomes between this node and the other 7 nodes.
I've also started to test if adding a new worker01
node that is simply a VM with one network interface attached would have the same issue.
master00~$ kubeadm token create --print-join-command
<command_here>
worker01~$ kubeadm join <endpoint> --token <token> --discovery-token-ca-cert-hash <hash>
The <endpoint>
is reachable, so the join works fine:
master00~$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master00 Ready control-plane,master 42d v1.23.4 172.16.0.180 <none> Ubuntu 22.04.1 LTS 5.15.0-73-generic containerd://1.7.2
master01 Ready control-plane,master 42d v1.23.4 172.16.0.181 <none> Ubuntu 22.04.1 LTS 5.15.0-73-generic containerd://1.7.2
master02 Ready control-plane,master 42d v1.23.4 172.16.0.182 <none> Ubuntu 22.04.1 LTS 5.15.0-73-generic containerd://1.7.2
worker00 Ready <none> 5d19h v1.23.4 172.16.0.169 <none> Ubuntu 22.04.3 LTS 5.15.0-84-generic containerd://1.7.2
worker01 Ready <none> 144m v1.23.4 172.16.0.168 <none> Ubuntu 22.04.3 LTS 5.15.0-84-generic containerd://1.7.2
However, the Calico is not working on this new VM either, but it seems that I'm having a different issue now:
2023-10-04 11:42:14.337 [FATAL][22] tunnel-ip-allocator/allocateip.go 229: failed to fetch node resource 'worker01' error=Get "https://10.96.0.1:443/api/v1/nodes/worker01": dial tcp 10.96.0.1:443: connect: connection refused
Calico node failed to start
This is definitely not related to the initial error, so let's skip this for now (I'll go through the troubleshooting of this but I think this is firewall related)
The worker01
-calico-node
has the same issue, stuck in CrashLoopBackOff
, and pretty much same logs: https://pastebin.com/TZmbSQtg
Something I didn't notice initially, is that in the logs happening on worker00
-calico-node
(the physical server one), I can see this line that I initially overlooked:
2023-09-28 09:06:58.591 [INFO][97] felix/int_dataplane.go 1856: attempted to modprobe nf_conntrack_proto_sctp error=exit status 1 output=""
Found this open issue from which the last comment has the same error-line: https://github.com/projectcalico/calico/issues/5627#issuecomment-1726075725
@caseydavenport After many hours of investigating as to why the calico-node
pods were not running, I've finally found out that for some reason, my kubernetes
Ansible role that has been used for deploying the nodes, had a faulty ansible.builtin.lineinfile
where it's supposed to change the SystemdCgroup
value on the containerd
instance from false
to true
, as this is a requirement for both kubelet
and containerd
: https://kubernetes.io/docs/setup/production-environment/container-runtimes/#containerd-systemd in the /etc/containerd/config.toml
configuration file.
This misconfiguration led to an unhealthy state of pods running on the worker nodes, with no apparent oddities in the logs, masking the underlying problem.
A manual override to set the SystemdCgroup
value to true
, followed by restarting containerd
and kubelet
services, and rebooting DaemonSet and Deployments resources on Kubernetes, resolved the issue:
# to run on both worker nodes
worker0[0,1]$ systemctl restart containerd
worker0[0,1]$ systemctl restart kubelet
# to run on the master node
master00$ kubectl rollout restart deployment -n kube-system calico-kube-controllers
master00$ kubectl rollout restart ds -n kube-system calico-node
The calico-node
pods transitioned to a healthy state, rectifying the CrashLoopBackOff
state, and restoring pod connectivity within the cluster on the affected worker nodes.
The pods have been running for several hours now without issues:
kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system calico-kube-controllers-5b7984c9f4-j8pvs 1/1 Running 0 9h 10.244.131.129 worker00 <none> <none>
kube-system calico-node-9rtrc 1/1 Running 1 (14h ago) 14h 172.16.0.168 worker01 <none> <none>
kube-system calico-node-s2cvc 1/1 Running 0 14h 172.16.0.182 master02 <none> <none>
kube-system calico-node-tj72m 1/1 Running 0 14h 172.16.0.180 master00 <none> <none>
kube-system calico-node-wdjtg 1/1 Running 0 14h 172.16.0.181 master01 <none> <none>
kube-system calico-node-zlvdg 1/1 Running 0 14h 172.16.0.169 worker00 <none> <none>
kube-system coredns-64897985d-fpmqz 1/1 Running 0 14h 10.244.5.5 worker01 <none> <none>
kube-system coredns-64897985d-tmvnp 1/1 Running 0 14h 10.244.131.145 worker00 <none> <none>
kube-system etcd-master00 1/1 Running 159 42d 172.16.0.180 master00 <none> <none>
kube-system etcd-master01 1/1 Running 0 42d 172.16.0.181 master01 <none> <none>
kube-system etcd-master02 1/1 Running 0 42d 172.16.0.182 master02 <none> <none>
kube-system kube-apiserver-master00 1/1 Running 213 42d 172.16.0.180 master00 <none> <none>
kube-system kube-apiserver-master01 1/1 Running 4 42d 172.16.0.181 master01 <none> <none>
kube-system kube-apiserver-master02 1/1 Running 4 42d 172.16.0.182 master02 <none> <none>
kube-system kube-controller-manager-master00 1/1 Running 17 (42d ago) 42d 172.16.0.180 master00 <none> <none>
kube-system kube-controller-manager-master01 1/1 Running 4 42d 172.16.0.181 master01 <none> <none>
kube-system kube-controller-manager-master02 1/1 Running 4 42d 172.16.0.182 master02 <none> <none>
kube-system kube-proxy-jnjm5 1/1 Running 0 14h 172.16.0.169 worker00 <none> <none>
kube-system kube-proxy-l594r 1/1 Running 0 42d 172.16.0.181 master01 <none> <none>
kube-system kube-proxy-mwr2v 1/1 Running 0 42d 172.16.0.180 master00 <none> <none>
kube-system kube-proxy-rhxg4 1/1 Running 0 14h 172.16.0.168 worker01 <none> <none>
kube-system kube-proxy-wvvj9 1/1 Running 0 42d 172.16.0.182 master02 <none> <none>
kube-system kube-scheduler-master00 1/1 Running 17 (42d ago) 42d 172.16.0.180 master00 <none> <none>
kube-system kube-scheduler-master01 1/1 Running 4 42d 172.16.0.181 master01 <none> <none>
kube-system kube-scheduler-master02 1/1 Running 4 42d 172.16.0.182 master02 <none> <none>
kube-system traefik-ingress-754c7db4cb-7f748 1/1 Running 0 42d 172.16.0.180 master00 <none> <none>
Thank you again for your help and investigation on this, I honestly didn't even think to check this parameter as it was initially already set to true
on all the other nodes.
I'm closing this issue, as this is not related to a Calico error.
Persistent Error querying BIRD on worker node with multiple NICs despite using
IP_AUTODETECTION_METHOD
set tocan-reach
to the correct gateway.Expected Behavior
Calico-node pod should run without issues on nodes with multiple NICs, and BGP sessions between nodes should be established and stable.
Current Behavior
The
calico-node
pod on theworker00
node is continuously restarting due to aReadiness probe failed
error. The core error message is:Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
. This leads to the pod being in aCrashLoopBackOff
state.Although the BGP status from a working-calico pod shows the BGP connection as established, the pod still gets killed. A
tcpdump
on the affected node shows incoming BGP connection attempts that are immediately denied.Even after removing the
readinessProbe
andlivenessProbe
from thecalico-node
, the pod getsSIGKILL
, logging this in thekubectl describe pod
:No outstanding logs have been found in the
kubelet
service, however, maybe this log from thecontainerd
service could be interesting:Possible Solution
Suspected issues with
calico-node
. However, the exact root cause is yet to be determined.Steps to Reproduce (for bugs)
worker00
) has multiple NICs.calico-node
pod status onworker00
.tcpdump
.Context
This issue has caused network connectivity problems for pods running on the
worker00
node, making it a blocking issue. The goal is to have a stable networking environment for pods across all nodes.Currently, the Calico pods running on the master nodes are healthy, whereas the
calico-node
onworker00
isn't:Your Environment