Cluster Networking fails when joining Worker Nodes - BIRD is not ready: BGP not established

massih10 commented 3 years ago

Hi Everyone, I'm running Kubernetes on a fresh installation of Ubuntu 20.04 Virtual Machines. The Host System is a Debian XEN providing the VMs with static IP Addresses. I initialize my Cluster on the master node with IP 192.168.220.4 via

sudo kubeadm init --apiserver-advertise-address=192.168.220.4 --pod-network-cidr=10.244.0.0/16

and then I install calico, following the docs at [https://docs.projectcalico.org/getting-started/kubernetes/self-managed-onprem/onpremises#install-calico-with-kubernetes-api-datastore-50-nodes-or-less]

curl https://docs.projectcalico.org/manifests/calico.yaml -O kubectl apply -f calico.yaml

When its only the master node networking works fine, but as soon as I join my worker node with IP 192.168.220.8, DNS fails.

kubectl describe pod calico-node-jmwf9 -n kube-system

Name:                     calico-node-jmwf9
Namespace:            kube-system
Priority:                   2000001000
Priority Class Name:  system-node-critical
Node:                     ubuntuxen4/192.168.220.4
Events:
Type     Reason     Age                     From     Message
Warning  Unhealthy  3m15s (x4852 over 13h)  kubelet  (combined from similar events): Readiness probe failed: 2021-04-01 05:59:23.745 [INFO][137802] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.220.8

kubectl describe pod calico-node-8pb6d -n kube-system

Name:                     calico-node-8pb6d
Namespace:            kube-system
Priority:                   2000001000
Priority Class Name:  system-node-critical
Node:                      ubuntuxen8/192.168.220.8
Start Time:              Wed, 31 Mar 2021 16:29:18 +0000
Labels:                    controller-revision-hash=5c6dfbb5b8
                               k8s-app=calico-node
                               pod-template-generation=5
Annotations:          <none>
Status:                    Running
IP:                           192.168.220.8
Events:
Type     Reason     Age                   From     Message
Warning  Unhealthy  41s (x4942 over 13h)  kubelet  (combined from similar events): Readiness probe failed: 2021-04-01 06:14:27.195 [INFO][138541] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.220.4

kubectl get pods -A

NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE
kube-system   calico-kube-controllers-69496d8b75-ngdbk   1/1     Running   0          14h
kube-system   calico-node-8pb6d                                           0/1     Running   0          13h
kube-system   calico-node-jmwf9                                            0/1     Running   0          13h
kube-system   coredns-74ff55c5b-mlp6v                                1/1     Running   2          15h
kube-system   coredns-74ff55c5b-pl5zq                                 1/1     Running   2          15h
kube-system   etcd-ubuntuxen4                                              1/1     Running   0          15h
kube-system   kube-apiserver-ubuntuxen4                             1/1     Running   0          15h
kube-system   kube-controller-manager-ubuntuxen4             1/1     Running   0          15h
kube-system   kube-proxy-9q96p                                            1/1     Running   0          15h
kube-system   kube-proxy-x24pr                                             1/1     Running   0          14h
kube-system   kube-scheduler-ubuntuxen4                             1/1     Running   0          15h

sudo calicoctl node status

Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+----------+---------+
| PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |  INFO   |
+---------------+-------------------+-------+----------+---------+
| 192.168.220.8 | node-to-node mesh | start | 16:29:25 | Passive |
+---------------+-------------------+-------+----------+---------+

IPv6 BGP status
No IPv6 peers found.

ip link

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 00:16:3e:4c:9a:32 brd ff:ff:ff:ff:ff:ff
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default 
link/ether 02:42:5b:46:72:51 brd ff:ff:ff:ff:ff:ff
16: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
30: datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000  link/ether be:53:d0:d6:0e:fe brd ff:ff:ff:ff:ff:ff
34: vethwe-datapath@vethwe-bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master datapath state UP mode DEFAULT group default  link/ether d2:06:ae:c5:54:c0 brd ff:ff:ff:ff:ff:ff
35: vethwe-bridge@vethwe-datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP mode DEFAULT group default link/ether 26:bb:ec:08:48:3a brd ff:ff:ff:ff:ff:ff
36: vxlan-6784: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65535 qdisc noqueue master datapath state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 1a:8c:e7:b1:02:44 brd ff:ff:ff:ff:ff:ff
39: calif09669f5d93@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP mode DEFAULT group default  link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0
40: cali3725352823f@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP mode DEFAULT group default  link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 1
41: calief922f19ced@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP mode DEFAULT group default link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 2

Expected Behavior

Cluster DNS should work without problem. BGP networking should automatically be established. PEER Connection shows Passive.

Current Behavior

Warning Unhealthy Number of node(s) with BGP peering established = 0 calico/node is not ready: BIRD is not ready: BGP not established with 192.168.220.8

Possible Solution

I have already manually set the right network adapter via kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eth0

Steps to Reproduce (for bugs)

sudo kubeadm init --apiserver-advertise-address=192.168.220.4 --pod-network-cidr=10.244.0.0/16
kubectl apply -f calico.yaml
kubeadm join 192.168.220.4:6443 --token d7p4qq.o2o7sqn36fz5ahj3 --discovery-token-ca-cert-hash sha256:e2a6340f911e3c835cc9c2c9b7d8ce413b1d7927f8579c71f10a7b189f06fb62

Your Environment

calicoctl version Client Version: v3.14.0 Git commit: c97876ba Cluster Version: v3.18.1 Cluster Type: k8s,bgp,kubeadm,kdd kubectl version --short Client Version: v1.20.5 Server Version: v1.20.5

lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.2 LTS Release: 20.04 Codename: focal

Any help would be much appreciated.

caseydavenport commented 3 years ago

@massih10 have you ensured that your underlying network / firewall rules allow the necessary BGP traffic? Requirements are listed here: https://docs.projectcalico.org/getting-started/kubernetes/requirements#network-requirements

Assuming you have enabled TCP 179 in your firewall, could you also run calicoctl node status on the second node and post it's output here?

massih10 commented 3 years ago

Sorry for not replying.

The problem was with my iptables. They were messed up from too many installations of calico /Flannel / weave ...

Reset kubeadm, flush iptables (everything besides what's related to docker), re initiate kubeadm cluster and redeploy calico solved it for me.

daveherrington commented 1 year ago

I had this exact problem after setting up a new k8s cluster and using the Calico operator to install Calico per the instructions. None of the nodes could see any other peers. I added port 179/tcp and port 4789/udp to my firewall configuration and that resolved the issue. Might be good to include a firewall rule check in the install steps, to make sure all of the necessary ports are open, to avoid a troubleshooting exercise. I have to admit, however, that having to troubleshoot this issue caused me to learn more about how Calico works.

projectcalico / calico