projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.88k stars 1.31k forks source link

Calico node crashing without error message on Raspberry Pi 4 connected with wireless wlan0 #8819

Closed tkislan closed 2 weeks ago

tkislan commented 4 months ago
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
      - blockSize: 26
        cidr: 10.244.0.0/16
        encapsulation: VXLANCrossSubnet
        natOutgoing: Enabled
        nodeSelector: all()
    nodeAddressAutodetectionV4:
      interface: tun0

I'm using openvpn network to connect edge devices with master node running in the cloud I have Intel nuc device working as expected, from the same network as the problematic raspberry pi

ip addr output

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether dc:a6:32:9f:c1:27 brd ff:ff:ff:ff:ff:ff
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether dc:a6:32:9f:c1:28 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.23/24 metric 600 brd 192.168.1.255 scope global dynamic wlan0
       valid_lft 11832sec preferred_lft 11832sec
4: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 500
    link/none
    inet 10.8.0.7/24 scope global tun0
       valid_lft forever preferred_lft forever
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:02:f2:d8:dc brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
9: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
    inet 10.244.210.192/32 scope global tunl0
       valid_lft forever preferred_lft forever
50: calib9ebbc1fedc@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-d27bd61c-6107-d514-2f04-31a40d632e19
54: cali82b6a9674c3@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-b6010d39-9714-044e-86b7-7d308d8f310c

ethernet port is not used, and tun0 interface should be used, configured through autodetection, where wlan0 is the interface that is connected to the internet

there are no logs indicating any kind of error, the calico-node just ends up in Completed state, and is being restarted and other pods fail dns resolve, probably because kube-proxy pod is crashing as well

but what is very suspicious is, that there are multiple logs in calico-node, with EndpointId=eth0, which doesn't make sense, because it is disabled and not used

logs: calico-node-describe.txt calico-node.log csi-node-driver.log kube-proxy-describe.txt kube-proxy.log

Expected Behavior

Current Behavior

Endless CrashLoopBackOff, no pods working on the node

Possible Solution

Steps to Reproduce (for bugs)

  1. Install calico tigera operator
  2. kubeadm join raspberry pi with wlan0 interface

Context

Your Environment

tomastigera commented 4 months ago

@tkislan could it be related to https://github.com/projectcalico/calico/issues/8726 ?

tkislan commented 4 months ago

Doesn't seem to have helped when I unloaded the kernel module, and restarted the pods

  Warning  Unhealthy       20s (x2 over 21s)  kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Warning  Unhealthy       16s                kubelet            Readiness probe failed: 2024-05-14 16:56:55.526 [INFO][243] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.8.0.13,10.8.0.1,10.8.0.6

at least here it seems it's getting killed because of the healtcheck

# ls -l /var/run/calico
total 0
srw-rw---- 1 root root  0 May 14 19:00 bird.ctl
srw-rw---- 1 root root  0 May 14 19:00 bird6.ctl
drwx------ 2 root root 40 May 14 18:56 cgroup
-rw------- 1 root root  0 May 14 18:56 ipam.lock

but the files exist on the host

# ./calicoctl node checksystem
Checking kernel version...
        6.8.0-1004-raspi                        OK
Checking kernel modules...
        nf_conntrack_netlink                    OK
        xt_addrtype                             OK
        xt_icmp                                 OK
        ip_set                                  OK
        ip6_tables                              OK
        ip_tables                               OK
        ipt_rpfilter                            OK
        xt_mark                                 OK
        xt_multiport                            OK
        vfio-pci                                OK
        xt_bpf                                  OK
        ipt_REJECT                              OK
        xt_rpfilter                             OK
        ipt_set                                 OK
        xt_icmp6                                OK
        ipt_ipvs                                OK
        xt_conntrack                            OK
        xt_set                                  OK
        xt_u32                                  OK
System meets minimum system requirements to run Calico!

let me know what more information I can provide .. I'm really desperate here .. have been trying to figure this out for the past 3 days

coutinhop commented 3 months ago

@tkislan could you please enable debug logging (by setting logSeverityScreen to Debug in the default FelixConfiguration), and see if that gives us more info?

kubectl patch felixconfiguration default --type merge --patch='{"spec":{"logSeverityScreen":"Debug"}}'
caseydavenport commented 3 months ago

but what is very suspicious is, that there are multiple logs in calico-node, with EndpointId=eth0, which doesn't make sense, because it is disabled and not used

This is referring to the endpoint name within the container, not the host's eth0, so I think this is OK and a red herring.

Typically, when calico/node just stops without any indication, it's due to kubelet or something external to Calico shutting us down for some reason.

Looking at the logs, it appears like calico/node is report that it is "live", so it is unlikely to be due to the liveness probe.

I think you may want to look at the kubelet or container runtime logs here to see if either of those suggest they are terminating the calico/node pod.

caseydavenport commented 2 months ago

Any news on this issue? Did you get a chance to look at the kubelet / runtime logs to see if either is killing Calico?

fasaxc commented 2 weeks ago

Closing as stale.