Closed tkislan closed 2 weeks ago
@tkislan could it be related to https://github.com/projectcalico/calico/issues/8726 ?
Doesn't seem to have helped when I unloaded the kernel module, and restarted the pods
Warning Unhealthy 20s (x2 over 21s) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
Warning Unhealthy 16s kubelet Readiness probe failed: 2024-05-14 16:56:55.526 [INFO][243] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.8.0.13,10.8.0.1,10.8.0.6
at least here it seems it's getting killed because of the healtcheck
# ls -l /var/run/calico
total 0
srw-rw---- 1 root root 0 May 14 19:00 bird.ctl
srw-rw---- 1 root root 0 May 14 19:00 bird6.ctl
drwx------ 2 root root 40 May 14 18:56 cgroup
-rw------- 1 root root 0 May 14 18:56 ipam.lock
but the files exist on the host
# ./calicoctl node checksystem
Checking kernel version...
6.8.0-1004-raspi OK
Checking kernel modules...
nf_conntrack_netlink OK
xt_addrtype OK
xt_icmp OK
ip_set OK
ip6_tables OK
ip_tables OK
ipt_rpfilter OK
xt_mark OK
xt_multiport OK
vfio-pci OK
xt_bpf OK
ipt_REJECT OK
xt_rpfilter OK
ipt_set OK
xt_icmp6 OK
ipt_ipvs OK
xt_conntrack OK
xt_set OK
xt_u32 OK
System meets minimum system requirements to run Calico!
let me know what more information I can provide .. I'm really desperate here .. have been trying to figure this out for the past 3 days
@tkislan could you please enable debug logging (by setting logSeverityScreen to Debug in the default FelixConfiguration), and see if that gives us more info?
kubectl patch felixconfiguration default --type merge --patch='{"spec":{"logSeverityScreen":"Debug"}}'
but what is very suspicious is, that there are multiple logs in calico-node, with EndpointId=eth0, which doesn't make sense, because it is disabled and not used
This is referring to the endpoint name within the container, not the host's eth0, so I think this is OK and a red herring.
Typically, when calico/node just stops without any indication, it's due to kubelet or something external to Calico shutting us down for some reason.
Looking at the logs, it appears like calico/node is report that it is "live", so it is unlikely to be due to the liveness probe.
I think you may want to look at the kubelet or container runtime logs here to see if either of those suggest they are terminating the calico/node pod.
Any news on this issue? Did you get a chance to look at the kubelet / runtime logs to see if either is killing Calico?
Closing as stale.
I'm using openvpn network to connect edge devices with master node running in the cloud I have Intel nuc device working as expected, from the same network as the problematic raspberry pi
ip addr
outputethernet port is not used, and
tun0
interface should be used, configured through autodetection, wherewlan0
is the interface that is connected to the internetthere are no logs indicating any kind of error, the calico-node just ends up in Completed state, and is being restarted and other pods fail dns resolve, probably because kube-proxy pod is crashing as well
but what is very suspicious is, that there are multiple logs in calico-node, with
EndpointId=eth0
, which doesn't make sense, because it is disabled and not usedlogs: calico-node-describe.txt calico-node.log csi-node-driver.log kube-proxy-describe.txt kube-proxy.log
Expected Behavior
Current Behavior
Endless CrashLoopBackOff, no pods working on the node
Possible Solution
Steps to Reproduce (for bugs)
Context
Your Environment