projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.05k stars 1.34k forks source link

calico-bpf Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory #8929

Closed dyrnq closed 5 months ago

dyrnq commented 5 months ago

Expected Behavior

Current Behavior

kubectl describe pod/$(kubectl get po -o jsonpath={.items[0].metadata.name} -l k8s-app=calico-node -n kube-system) -n kube-system
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  24m                default-scheduler  Successfully assigned kube-system/calico-node-9vpng to master3
  Normal   Pulling    24m                kubelet            Pulling image "docker.io/calico/cni:v3.28.0"
  Normal   Pulled     24m                kubelet            Successfully pulled image "docker.io/calico/cni:v3.28.0" in 14.98s (35.738s including waiting). Image size: 94536228 bytes.
  Normal   Created    24m                kubelet            Created container upgrade-ipam
  Normal   Started    24m                kubelet            Started container upgrade-ipam
  Normal   Pulled     24m                kubelet            Container image "docker.io/calico/cni:v3.28.0" already present on machine
  Normal   Created    24m                kubelet            Created container install-cni
  Normal   Started    24m                kubelet            Started container install-cni
  Normal   Pulling    24m                kubelet            Pulling image "docker.io/calico/node:v3.28.0"
  Normal   Pulled     23m                kubelet            Successfully pulled image "docker.io/calico/node:v3.28.0" in 14.495s (14.495s including waiting). Image size: 115239232 bytes.
  Normal   Created    23m                kubelet            Created container mount-bpffs
  Normal   Started    23m                kubelet            Started container mount-bpffs
  Normal   Pulled     23m                kubelet            Container image "docker.io/calico/node:v3.28.0" already present on machine
  Normal   Created    23m                kubelet            Created container calico-node
  Normal   Started    23m                kubelet            Started container calico-node
  Warning  Unhealthy  23m                kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
  Warning  Unhealthy  23m (x2 over 23m)  kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Warning  Unhealthy  23m                kubelet            Readiness probe failed: 2024-06-19 01:57:27.880 [INFO][305] confd/health.go 202: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  23m  kubelet  Readiness probe failed: 2024-06-19 01:57:37.868 [INFO][470] confd/health.go 202: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  23m  kubelet  Readiness probe failed: 2024-06-19 01:57:47.885 [INFO][630] confd/health.go 202: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  23m  kubelet  Liveness probe failed: calico/node is not ready: Felix is not live: Get "http://localhost:9099/liveness": dial tcp 127.0.0.1:9099: connect: connection refused
  Warning  Unhealthy  23m  kubelet  Readiness probe failed: 2024-06-19 01:57:57.880 [INFO][767] confd/health.go 202: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  22m  kubelet  Readiness probe failed: 2024-06-19 01:58:07.866 [INFO][928] confd/health.go 202: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  22m  kubelet  Readiness probe failed: 2024-06-19 01:58:17.849 [INFO][1063] confd/health.go 202: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp 127.0.0.1:9099: connect: connection refused
  Warning  Unhealthy  4m32s (x125 over 22m)  kubelet  (combined from similar events): Readiness probe failed: 2024-06-19 02:16:27.866 [INFO][17959] confd/health.go 202: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: readiness probe reporting 503

Possible Solution

kubectl get no,pod -o wide -A
NAME           STATUS   ROLES           AGE     VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
node/master1   Ready    control-plane   10m     v1.30.1   192.168.55.111   <none>        Ubuntu 20.04.6 LTS   5.4.0-173-generic   containerd://1.6.33
node/master2   Ready    control-plane   9m36s   v1.30.1   192.168.55.112   <none>        Ubuntu 20.04.6 LTS   5.4.0-173-generic   containerd://1.6.33
node/master3   Ready    control-plane   9m41s   v1.30.1   192.168.55.113   <none>        Ubuntu 20.04.6 LTS   5.4.0-173-generic   containerd://1.6.33

NAMESPACE              NAME                                             READY   STATUS      RESTARTS        AGE     IP               NODE      NOMINATED NODE   READINESS GATES
ingress-nginx          pod/ingress-nginx-admission-create-k2gxl         0/1     Completed   0               9m42s   10.244.137.66    master1   <none>           <none>
ingress-nginx          pod/ingress-nginx-admission-patch-b22ph          0/1     Completed   0               9m42s   10.244.137.72    master1   <none>           <none>
ingress-nginx          pod/ingress-nginx-controller-6f848d7788-cdjg6    1/1     Running     0               9m42s   10.244.137.73    master1   <none>           <none>
ingress-nginx          pod/ingress-nginx-controller-6f848d7788-mmvns    1/1     Running     0               9m42s   10.244.137.74    master1   <none>           <none>
kube-system            pod/calico-kube-controllers-564985c589-5jrwl     1/1     Running     0               9m42s   10.244.137.67    master1   <none>           <none>
kube-system            pod/calico-node-9vpng                            0/1     Running     0               9m41s   192.168.55.113   master3   <none>           <none>
kube-system            pod/calico-node-h6z47                            0/1     Running     0               9m36s   192.168.55.112   master2   <none>           <none>
kube-system            pod/calico-node-pltbx                            0/1     Running     0               9m42s   192.168.55.111   master1   <none>           <none>
kube-system            pod/coredns-7db6d8ff4d-k2q6c                     1/1     Running     0               9m42s   10.244.137.69    master1   <none>           <none>
kube-system            pod/coredns-7db6d8ff4d-nkfrs                     1/1     Running     0               9m42s   10.244.137.68    master1   <none>           <none>
kube-system            pod/kube-apiserver-master1                       1/1     Running     0               9m59s   192.168.55.111   master1   <none>           <none>
kube-system            pod/kube-apiserver-master2                       1/1     Running     0               9m21s   192.168.55.112   master2   <none>           <none>
kube-system            pod/kube-apiserver-master3                       1/1     Running     0               9m22s   192.168.55.113   master3   <none>           <none>
kube-system            pod/kube-controller-manager-master1              1/1     Running     1 (9m59s ago)   9m59s   192.168.55.111   master1   <none>           <none>
kube-system            pod/kube-controller-manager-master2              1/1     Running     0               9m16s   192.168.55.112   master2   <none>           <none>
kube-system            pod/kube-controller-manager-master3              1/1     Running     0               9m27s   192.168.55.113   master3   <none>           <none>
kube-system            pod/kube-proxy-8mkb8                             1/1     Running     0               9m41s   192.168.55.113   master3   <none>           <none>
kube-system            pod/kube-proxy-g6nxf                             1/1     Running     0               9m36s   192.168.55.112   master2   <none>           <none>
kube-system            pod/kube-proxy-z5gj7                             1/1     Running     0               9m42s   192.168.55.111   master1   <none>           <none>
kube-system            pod/kube-scheduler-master1                       1/1     Running     0               9m59s   192.168.55.111   master1   <none>           <none>
kube-system            pod/kube-scheduler-master2                       1/1     Running     0               9m27s   192.168.55.112   master2   <none>           <none>
kube-system            pod/kube-scheduler-master3                       1/1     Running     0               9m33s   192.168.55.113   master3   <none>           <none>
kube-system            pod/kubelet-csr-approver-6df44c648f-cn5q7        1/1     Running     0               9m42s   10.244.137.65    master1   <none>           <none>
kube-system            pod/metrics-server-758fd799ff-dvtz6              1/1     Running     0               9m42s   10.244.137.64    master1   <none>           <none>
kubernetes-dashboard   pod/dashboard-metrics-scraper-795895d745-vhz8l   1/1     Running     0               9m42s   10.244.137.70    master1   <none>           <none>
kubernetes-dashboard   pod/kubernetes-dashboard-697d5b47c4-vtgf7        1/1     Running     0               9m42s   10.244.137.71    master1   <none>           <none>

Steps to Reproduce (for bugs)

  1. install 3 master
  2. deploy https://github.com/projectcalico/calico/blob/v3.28.0/manifests/calico-bpf.yaml
  3. kubectl patch felixconfiguration default --type merge --patch='{"spec":{"logSeverityScreen":"Debug"}}'
kubectl get felixconfiguration default -o yaml
apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
  annotations:
    projectcalico.org/metadata: '{"creationTimestamp":"2024-06-19T01:57:00Z"}'
  creationTimestamp: "2024-06-19T01:57:00Z"
  generation: 2
  name: default
  resourceVersion: "2038"
  uid: 21df6527-8d87-442d-84b8-263b17186e37
spec:
  bpfConnectTimeLoadBalancing: TCP
  bpfHostNetworkedNATWithoutCTLB: Enabled
  bpfLogLevel: ""
  floatingIPs: Disabled
  logSeverityScreen: Debug
  reportingInterval: 0s
kubectl get -n kube-system ds calico-node -o yaml | grep -A1 -E "BPF|POOL_VXLAN|IPIP" |grep -v -E "{|Time"

--
        - name: FELIX_BPFENABLED
          value: "true"
--
        - name: CALICO_IPV4POOL_IPIP
          value: Never
        - name: CALICO_IPV4POOL_VXLAN
          value: Never
        - name: CALICO_IPV6POOL_VXLAN
          value: Never
(
calicoctl get nodes -o wide --allow-version-mismatch
calicoctl get felixconfiguration default -o yaml --allow-version-mismatch
)

NAME      ASN       IPV4                IPV6   
master1   (64512)   192.168.55.111/24          
master2   (64512)   192.168.55.112/24          
master3   (64512)   192.168.55.113/24          

apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
  creationTimestamp: "2024-06-19T01:57:00Z"
  name: default
  resourceVersion: "2038"
  uid: 84fba6e4-b1e1-44b4-841d-64dce81876ec
spec:
  bpfConnectTimeLoadBalancing: TCP
  bpfHostNetworkedNATWithoutCTLB: Enabled
  bpfLogLevel: ""
  floatingIPs: Disabled
  logSeverityScreen: Debug
  reportingInterval: 0s
ls -l /var/run/calico
total 0
srw-rw---- 1 root root  0 Jun 19 09:57 bird.ctl
srw-rw---- 1 root root  0 Jun 19 09:57 bird6.ctl
drw------- 3 root root 60 Jun 19 09:57 bpf
dr-xr-xr-x 6 root root  0 Jun 19 09:55 cgroup
-rw------- 1 root root  0 Jun 19 09:57 ipam.lock

Context

Your Environment

dyrnq commented 5 months ago
2024-06-19 02:12:07.863 [DEBUG][13926] felix/health.go 331: Calculated health summary: live=true ready=false
+---------------------------+---------+----------------+---------------------+-----------------+
|         COMPONENT         | TIMEOUT |    LIVENESS    |      READINESS      |     DETAIL      |
+---------------------------+---------+----------------+---------------------+-----------------+
| BPFEndpointManager        | -       | -              | reporting non-ready | Not yet synced. |
| CalculationGraph          | 30s     | reporting live | reporting ready     |                 |
| FelixStartup              | -       | reporting live | reporting ready     |                 |
| InternalDataplaneMainLoop | 1m30s   | reporting live | reporting non-ready |                 |
+---------------------------+---------+----------------+---------------------+-----------------+
2024-06-19 02:12:07.863 [INFO][13926] felix/health.go 336: Overall health status changed: live=true ready=false
+---------------------------+---------+----------------+---------------------+-----------------+
|         COMPONENT         | TIMEOUT |    LIVENESS    |      READINESS      |     DETAIL      |
+---------------------------+---------+----------------+---------------------+-----------------+
| BPFEndpointManager        | -       | -              | reporting non-ready | Not yet synced. |
| CalculationGraph          | 30s     | reporting live | reporting ready     |                 |
| FelixStartup              | -       | reporting live | reporting ready     |                 |
| InternalDataplaneMainLoop | 1m30s   | reporting live | reporting non-ready |                 |
+---------------------------+---------+----------------+---------------------+-----------------+

calico-node.log

dyrnq commented 5 months ago

calico-node-describe.log

ss -tunlp |grep bird
tcp    LISTEN  0       8                   0.0.0.0:179            0.0.0.0:*      users:(("bird",pid=14049,fd=7))                                                
curl http://localhost:9099/liveness
+---------------------------+---------+----------------+---------------------+-----------------+
|         COMPONENT         | TIMEOUT |    LIVENESS    |      READINESS      |     DETAIL      |
+---------------------------+---------+----------------+---------------------+-----------------+
| BPFEndpointManager        | -       | -              | reporting non-ready | Not yet synced. |
| CalculationGraph          | 30s     | reporting live | reporting ready     |                 |
| FelixStartup              | -       | reporting live | reporting ready     |                 |
| InternalDataplaneMainLoop | 1m30s   | reporting live | reporting non-ready |                 |
+---------------------------+---------+----------------+---------------------+-----------------+
kubectl get ippool default-ipv4-ippool -o yaml
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  annotations:
    projectcalico.org/metadata: '{"creationTimestamp":"2024-06-19T01:57:00Z"}'
  creationTimestamp: "2024-06-19T01:57:00Z"
  generation: 1
  name: default-ipv4-ippool
  resourceVersion: "989"
  uid: e6779b53-a976-43cc-999e-52eef4e3b026
spec:
  allowedUses:
  - Workload
  - Tunnel
  blockSize: 26
  cidr: 10.244.0.0/16
  ipipMode: Never
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: Never
sridhartigera commented 5 months ago

I see that kube-proxy is running. It needs to be disabled. Can you confirm if you followed the steps in https://docs.tigera.io/calico/latest/operations/ebpf/enabling-ebpf

tomastigera commented 5 months ago

Felix in node is restarting a lot due to kube-proxy in ipvs mode.

2024-06-19 02:12:07.689 [INFO][13926] felix/int_dataplane.go 1347: kube-proxy mode changed. Restart felix. ipvsIfaceState="down" ipvsSupport=false

Pls,make sure that you follow all steps when enabling ebpf including setting the config map etc.

dyrnq commented 5 months ago

Felix in node is restarting a lot due to kube-proxy in ipvs mode.

2024-06-19 02:12:07.689 [INFO][13926] felix/int_dataplane.go 1347: kube-proxy mode changed. Restart felix. ipvsIfaceState="down" ipvsSupport=false

Pls,make sure that you follow all steps when enabling ebpf including setting the config map etc.

@sridhartigera @tomastigera TKS, it`s working fine

after steps below I

kubectl patch ds -n kube-system kube-proxy -p '{"spec":{"template":{"spec":{"nodeSelector":{"non-calico": "true"}}}}}'

II AND must reboot node

At the beginning, I only executed kubectl patch ds -n kube-system kube-proxy -p '{"spec":{"template":{"spec":{"nodeSelector":{"non-calico": "true"}}}}}' but did not reboot the node

Maybe there is another better way(non-restart)?

working result

NAME           STATUS   ROLES           AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
node/master1   Ready    control-plane   21h   v1.30.1   192.168.55.111   <none>        Ubuntu 20.04.6 LTS   5.4.0-186-generic   containerd://1.6.33
node/master2   Ready    control-plane   21h   v1.30.1   192.168.55.112   <none>        Ubuntu 20.04.6 LTS   5.4.0-186-generic   containerd://1.6.33
node/master3   Ready    control-plane   21h   v1.30.1   192.168.55.113   <none>        Ubuntu 20.04.6 LTS   5.4.0-186-generic   containerd://1.6.33

NAMESPACE              NAME                                             READY   STATUS    RESTARTS         AGE   IP               NODE      NOMINATED NODE   READINESS GATES
ingress-nginx          pod/ingress-nginx-controller-6f848d7788-dg7mp    1/1     Running   1 (9m ago)       18h   10.244.137.92    master1   <none>           <none>
ingress-nginx          pod/ingress-nginx-controller-6f848d7788-z5gz7    1/1     Running   1 (9m ago)       18h   10.244.137.90    master1   <none>           <none>
kube-system            pod/calico-kube-controllers-564985c589-krlm5     1/1     Running   18 (11m ago)     18h   10.244.180.5     master2   <none>           <none>
kube-system            pod/calico-node-9vpng                            1/1     Running   20 (8m39s ago)   21h   192.168.55.113   master3   <none>           <none>
kube-system            pod/calico-node-h6z47                            1/1     Running   26 (11m ago)     21h   192.168.55.112   master2   <none>           <none>
kube-system            pod/calico-node-pltbx                            1/1     Running   8 (9m ago)       21h   192.168.55.111   master1   <none>           <none>
kube-system            pod/coredns-7db6d8ff4d-c26pp                     1/1     Running   3 (11m ago)      18h   10.244.180.4     master2   <none>           <none>
kube-system            pod/coredns-7db6d8ff4d-zqmwt                     1/1     Running   4 (8m39s ago)    18h   10.244.136.5     master3   <none>           <none>
kube-system            pod/kube-apiserver-master1                       1/1     Running   8 (9m ago)       21h   192.168.55.111   master1   <none>           <none>
kube-system            pod/kube-apiserver-master2                       1/1     Running   17 (11m ago)     21h   192.168.55.112   master2   <none>           <none>
kube-system            pod/kube-apiserver-master3                       1/1     Running   12 (8m39s ago)   21h   192.168.55.113   master3   <none>           <none>
kube-system            pod/kube-controller-manager-master1              1/1     Running   6 (9m ago)       21h   192.168.55.111   master1   <none>           <none>
kube-system            pod/kube-controller-manager-master2              1/1     Running   30 (11m ago)     21h   192.168.55.112   master2   <none>           <none>
kube-system            pod/kube-controller-manager-master3              1/1     Running   29 (8m39s ago)   21h   192.168.55.113   master3   <none>           <none>
kube-system            pod/kube-scheduler-master1                       1/1     Running   6 (9m ago)       21h   192.168.55.111   master1   <none>           <none>
kube-system            pod/kube-scheduler-master2                       1/1     Running   28 (11m ago)     21h   192.168.55.112   master2   <none>           <none>
kube-system            pod/kube-scheduler-master3                       1/1     Running   27 (8m39s ago)   21h   192.168.55.113   master3   <none>           <none>
kube-system            pod/kubelet-csr-approver-6df44c648f-kkhn7        1/1     Running   20 (8m39s ago)   18h   10.244.136.6     master3   <none>           <none>
kube-system            pod/metrics-server-758fd799ff-p6txd              1/1     Running   2 (8m39s ago)    18h   10.244.137.91    master1   <none>           <none>
kubernetes-dashboard   pod/dashboard-metrics-scraper-795895d745-vfcgj   1/1     Running   1 (9m ago)       18h   10.244.137.93    master1   <none>           <none>
kubernetes-dashboard   pod/kubernetes-dashboard-697d5b47c4-p9hz6        1/1     Running   2 (8m14s ago)    18h   10.244.137.89    master1   <none>           <none>
curl http://localhost:9099/liveness
+---------------------------+---------+----------------+-----------------+--------+
|         COMPONENT         | TIMEOUT |    LIVENESS    |    READINESS    | DETAIL |
+---------------------------+---------+----------------+-----------------+--------+
| BPFEndpointManager        | -       | -              | reporting ready |        |
| CalculationGraph          | 30s     | reporting live | reporting ready |        |
| FelixStartup              | -       | reporting live | reporting ready |        |
| InternalDataplaneMainLoop | 1m30s   | reporting live | reporting ready |        |
+---------------------------+---------+----------------+-----------------+--------+
dyrnq commented 5 months ago

I also tried

I kubectl -n kube-system delete ds kube-proxy and kubectl -n kube-system delete cm kube-proxy II kubectl apply -f https://github.com/projectcalico/calico/blob/v3.28.0/manifests/calico-bpf.yaml

The result is the same,Maybe MUST init cluster with kubeadm init --skip-phases=addon/kube-proxy?

curl http://localhost:9099/liveness
+---------------------------+---------+----------------+---------------------+-----------------+
|         COMPONENT         | TIMEOUT |    LIVENESS    |      READINESS      |     DETAIL      |
+---------------------------+---------+----------------+---------------------+-----------------+
| BPFEndpointManager        | -       | -              | reporting non-ready | Not yet synced. |
| CalculationGraph          | 30s     | reporting live | reporting ready     |                 |
| FelixStartup              | -       | reporting live | reporting ready     |                 |
| InternalDataplaneMainLoop | 1m30s   | reporting live | reporting non-ready |                 |
+---------------------------+---------+----------------+---------------------+-----------------+

finally I tried kubeadm init --skip-phases=addon/kube-proxy,Indeed it is the most perfect, everything goes well

dyrnq commented 5 months ago

finally I found the way without reboot

ip link delete kube-ipvs0 :)

sridhartigera commented 5 months ago

were you running kube-proxy in ipvs mode before switching to ebpf? If so, we don't support ipvs mode-> ebpf. Better to switch to iptables mode before moving to ebpf.

dyrnq commented 5 months ago

were you running kube-proxy in ipvs mode before switching to ebpf? If so, we don't support ipvs mode-> ebpf. Better to switch to iptables mode before moving to ebpf.

Got it, thanks for the reply

sridhartigera commented 5 months ago

Closing this issue.

blackliner commented 3 months ago

finally I found the way without reboot

ip link delete kube-ipvs0 :)

Thanks!!! I switched from kube-proxy with ipvs (kubekey deployed) to calico ebpf, and all of calico was going nuts, until I ran ip link delete kube-ipvs0 on each node! Reboot would also do it, true