projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.89k stars 1.31k forks source link

Calico Node keeps restarting with CrashLoopBackOff #7951

Closed 8Iue closed 11 months ago

8Iue commented 1 year ago

Persistent Error querying BIRD on worker node with multiple NICs despite using IP_AUTODETECTION_METHOD set to can-reach to the correct gateway.

Expected Behavior

Calico-node pod should run without issues on nodes with multiple NICs, and BGP sessions between nodes should be established and stable.

Current Behavior

The calico-node pod on the worker00 node is continuously restarting due to a Readiness probe failed error. The core error message is: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused. This leads to the pod being in a CrashLoopBackOff state.

Events:                                                                                                                                                                           
  Type     Reason     Age                    From               Message                                                                    
  ----     ------     ----                   ----               -------                                                                                                                                                                                                                                                                                                   
  Normal   Scheduled  8m55s                  default-scheduler  Successfully assigned kube-system/calico-node-g85cf to worker00                                                                
  Normal   Started    8m53s                  kubelet            Started container mount-bpffs                                                                
  Normal   Created    8m52s                  kubelet            Created container calico-node                                                                                                                                                                                                                                                                             
  Normal   Started    8m52s                  kubelet            Started container calico-node                                                                                                                                                                                                                                                                             
  Warning  Unhealthy  8m50s (x2 over 8m51s)  kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused                                                                                                             
  Warning  Unhealthy  8m46s                  kubelet            Readiness probe failed: 2023-08-21 15:50:19.834 [INFO][362] confd/health.go 180: Number of node(s) with BGP peering established = 2                                                                                                                                                                       
calico/node is not ready: BIRD is not ready: BGP not established with 172.16.0.180                                                                                                                                                                                                                                                                                        
  Normal  Killing         8m36s                  kubelet  Stopping container calico-node                                                                                                                                                                                                                                                                                  
  Normal  Pulled          8m35s (x2 over 8m55s)  kubelet  Container image "docker.io/calico/cni:v3.24.5" already present on machine                                                                                                                                                                                                                                        
  Normal  Created         8m35s (x2 over 8m55s)  kubelet  Created container upgrade-ipam                               
  Normal  Started         8m35s (x2 over 8m55s)  kubelet  Started container upgrade-ipam                                                                                                                                                                                                                                                                                   
  Normal  SandboxChanged  8m35s                  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  Started         8m34s (x2 over 8m55s)  kubelet  Started container install-cni                                                                                                                                                                                                                                                                                   
  Normal  Created         8m34s (x2 over 8m55s)  kubelet  Created container install-cni                                                                                                                                                                                                                                                                                    
  Normal  Pulled          8m34s (x2 over 8m55s)  kubelet  Container image "docker.io/calico/cni:v3.24.5" already present on machine        
  Normal  Pulled          8m32s (x2 over 8m53s)  kubelet  Container image "docker.io/calico/node:v3.24.5" already present on machine                                                                                                                                                                                                                                       
  Normal  Created         8m32s (x2 over 8m53s)  kubelet  Created container mount-bpffs
  Normal  Pulled          3m44s (x6 over 8m52s)  kubelet  Container image "docker.io/calico/node:v3.24.5" already present on machine

Although the BGP status from a working-calico pod shows the BGP connection as established, the pod still gets killed. A tcpdump on the affected node shows incoming BGP connection attempts that are immediately denied.

worker00:~$ sudo tcpdump -ni eno1 port 179
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:04:42.411530 IP 172.16.0.182.41261 > 172.16.0.169.179: Flags [S], seq 3590139507, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
14:04:42.411595 IP 172.16.0.169.179 > 172.16.0.182.41261: Flags [R.], seq 0, ack 3590139508, win 0, length 0
14:04:42.417066 IP 172.16.0.180.58505 > 172.16.0.169.179: Flags [S], seq 1622203016, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
14:04:42.417117 IP 172.16.0.169.179 > 172.16.0.180.58505: Flags [R.], seq 0, ack 1622203017, win 0, length 0
14:04:43.417252 IP 172.16.0.181.45411 > 172.16.0.169.179: Flags [S], seq 1452743444, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
14:04:43.417301 IP 172.16.0.169.179 > 172.16.0.181.45411: Flags [R.], seq 0, ack 1452743445, win 0, length 0

Even after removing the readinessProbe and livenessProbe from the calico-node, the pod gets SIGKILL , logging this in the kubectl describe pod:

Normal   SandboxChanged  8s                 kubelet            Pod sandbox changed, it will be killed and re-created.                                                                                                                                                                                                      

No outstanding logs have been found in the kubelet service, however, maybe this log from the containerd service could be interesting:

level=info msg="shim disconnected" id=eea5299aa39969b2ba047880955c7aa41d32f3385168a17fe7316ee2b40aa087 namespace=k8s.io

Possible Solution

Suspected issues with calico-node. However, the exact root cause is yet to be determined.

Steps to Reproduce (for bugs)

  1. Deploy Calico on a Kubernetes cluster where one of the nodes (worker00) has multiple NICs.
  2. Observe the calico-node pod status on worker00.
  3. Check the BGP status from a working-calico pod.
  4. Monitor BGP traffic using tcpdump.

Context

This issue has caused network connectivity problems for pods running on the worker00 node, making it a blocking issue. The goal is to have a stable networking environment for pods across all nodes.

Currently, the Calico pods running on the master nodes are healthy, whereas the calico-node on worker00 isn't:

worker00:~$ sudo kubectl get pods -n kube-system -o wide | grep calico
calico-kube-controllers-c6fd89d74-l7jpg   0/1     CrashLoopBackOff   38 (81s ago)     3h24m   10.244.131.190   worker00   <none>           <none>
calico-node-j468d                         0/1     Running            39 (5m7s ago)    3h24m   172.16.0.169     worker00   <none>           <none>
calico-node-ltjd7                         1/1     Running            0                3h18m   172.16.0.181     master01   <none>           <none>
calico-node-rd5b7                         1/1     Running            0                3h24m   172.16.0.180     master00   <none>           <none>
calico-node-wfgtz                         1/1     Running            0                3h24m   172.16.0.182     master02   <none>           <none>

Your Environment

coutinhop commented 1 year ago

Hi @8Iue, thanks for the detailed report! Could you share some more details about your setup, say, the topology and BGP peerings? Is there any chance that the address you used in the can-reach autodetection config could be reached by multiple NICs and thus Calico picks up the wrong one? Could you try using the interface=XXX method to the correct interface to check if this fixes the issue?

8Iue commented 1 year ago

thanks for the feedback @coutinhop!

Topology and BGP Peerings:

Internal IPv4 Addresses:

I've set the IP_AUTODETECTION_METHOD to target the default gateway using can-reach=172.16.0.1. When I run kubectl describe pod for the Calico pod on worker00, it correctly returns the IP value as 172.16.0.169:

Name:                 calico-node-j468d
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 worker00/172.16.0.169
Start Time:           Wed, 23 Aug 2023 11:01:41 +0200
Labels:               controller-revision-hash=76c9f59b5f
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          <none>
Status:               Running
IP:                   172.16.0.169
IPs:
  IP:           172.16.0.169
Controlled By:  DaemonSet/calico-node

Given the topology and the IP addresses, there's only one active NIC on worker00 that can reach 172.16.0.1. However, I understand your point about the possibility of multiple NICs being able to reach the specified address. I've also tried setting the interface=XXX, however, same behaviour.

8Iue commented 12 months ago

@coutinhop Just an update as I've started again to look into this.

Whilst restarting the Calico resources manually (deleting and recreating them), the Calico Kube Controller attempted to start on the worker00, and I've found these logs entries in the /var/log/calico/cni/cni.log (Calico's log_file_path parameter):

2023-09-25 19:17:48.678 [INFO][4181011] k8s.go 411: Added Mac, interface name, and active container ID to endpoint ContainerID="cd442b74c97389a8bbe6c81b23cdae13e8ef1ee53e429d4fc602917322f7254d" Namespace="kube-system" Pod="calico-kube-controllers-7b7d9b66c-t2x8x" WorkloadEndpoint="worker00-k8s-calico--kube--controllers--7b7d9b66c--t2x8x-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"worker00-k8s-calico--kube--controllers--7b7d9b66c--t2x8x-eth0", GenerateName:"calico-kube-controllers-7b7d9b66c-", Namespace:"kube-system", SelfLink:"", UID:"ef780853-c94c-4d94-9c18-69db2a4e3312", ResourceVersion:"5723018", Generation:0, CreationTimestamp:time.Date(2023, time.September, 25, 19, 17, 23, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"k8s-app":"calico-kube-controllers", "pod-template-hash":"7b7d9b66c", "projectcalico.org/namespace":"kube-system", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"calico-kube-controllers"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ZZZ_DeprecatedClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"worker00", ContainerID:"cd442b74c97389a8bbe6c81b23cdae13e8ef1ee53e429d4fc602917322f7254d", Pod:"calico-kube-controllers-7b7d9b66c-t2x8x", Endpoint:"eth0", ServiceAccountName:"calico-kube-controllers", IPNetworks:[]string{"10.244.131.175/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.kube-system", "ksa.kube-system.calico-kube-controllers"}, InterfaceName:"calic82c0ab436c", MAC:"56:a7:35:b3:71:62", Ports:[]v3.WorkloadEndpointPort(nil), AllowSpoofedSourcePrefixes:[]string(nil)}}                  

The WorkloadEndpoint parameter is appending to the end eth0, and the Endpoint parameter has been set to eth0, however, the worker00 node does not have an eth0 interface:

$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether b0:7b:25:f8:07:90 brd ff:ff:ff:ff:ff:ff
    altname enp225s0f0
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether b0:7b:25:f8:07:91 brd ff:ff:ff:ff:ff:ff
    altname enp225s0f1
4: enp161s0f0np0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether 5c:6f:69:5d:2f:a0 brd ff:ff:ff:ff:ff:ff
5: enp161s0f1np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether 5c:6f:69:5d:2f:a1 brd ff:ff:ff:ff:ff:ff
6: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default 
    link/ether 4e:98:af:ec:91:9a brd ff:ff:ff:ff:ff:ff
42297: cali77b90bb33ab@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-305c6573-6d70-2d14-07a8-b54f568be31f
42897: cali6973da3279e@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-4bf77a94-262b-d9aa-2c19-5dcad76add85
42900: calic82c0ab436c@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-09626670-9935-81fa-4280-1d69627ffd5d
35564: cni0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 1a:0d:65:51:2a:6f brd ff:ff:ff:ff:ff:ff
42233: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0

The IP_AUTODETECTION_METHOD parameter was set in the DaemonSet for the calico-node resource, not on the Calico Controller, but I couldn't find a parameter for the interface selection on the Controller documentation: https://docs.tigera.io/calico/3.24/reference/kube-controllers/configuration

Additionally, this error happens only on pods that are attempting to start on the worker00 node, selecting the wrong eth0 interface.

Here are some additional logs, excluding the pods names, for further info:

$ grep "2023-09-25" /var/log/calico/cni/cni.log | grep -vE "controllers|runner|cert--manager" | grep eth0

2023-09-25 23:57:53.148 [INFO][223069] dataplane_linux.go 524: Deleting workload's device in netns. ContainerID="a86aebb592ea77f2d2d5312ad486048a4d96f9cc3017bd37cc86e5e4a5bf34c2" iface="eth0" netns="/var/run/netns/cni-41dca964-e6c0-ce17-a4cd-31f0e92f29c9"
2023-09-25 23:57:53.148 [INFO][223069] dataplane_linux.go 535: Entered netns, deleting veth. ContainerID="a86aebb592ea77f2d2d5312ad486048a4d96f9cc3017bd37cc86e5e4a5bf34c2" iface="eth0" netns="/var/run/netns/cni-41dca964-e6c0-ce17-a4cd-31f0e92f29c9"
2023-09-25 23:57:53.209 [INFO][223069] dataplane_linux.go 569: Deleted device in netns. ContainerID="a86aebb592ea77f2d2d5312ad486048a4d96f9cc3017bd37cc86e5e4a5bf34c2" after=61.2644ms iface="eth0" netns="/var/run/netns/cni-41dca964-e6c0-ce17-a4cd-31f0e92f29c9"
2023-09-25 23:57:53.438 [INFO][223139] dataplane_linux.go 520: CleanUpNamespace called with no netns name, ignoring. ContainerID="4c6546205912a20a2e31a9b5fccbab87f993da3516eb7e3a2b813fa7cc3e6d8a" iface="eth0" netns=""
2023-09-25 23:57:54.104 [INFO][223204] dataplane_linux.go 520: CleanUpNamespace called with no netns name, ignoring. ContainerID="a86aebb592ea77f2d2d5312ad486048a4d96f9cc3017bd37cc86e5e4a5bf34c2" iface="eth0" netns=""
2023-09-25 23:57:54.248 [INFO][223270] dataplane_linux.go 520: CleanUpNamespace called with no netns name, ignoring. ContainerID="4c6546205912a20a2e31a9b5fccbab87f993da3516eb7e3a2b813fa7cc3e6d8a" iface="eth0" netns=""

These logs get repeated, but I believe these are only symptoms of the wrong interface selection.

caseydavenport commented 11 months ago

The WorkloadEndpoint parameter is appending to the end eth0, and the Endpoint parameter has been set to eth0, however, the worker00 node does not have an eth0 interface:

I think this is a red herring - the WorkloadEndpoint eth0 is referring to the name of the interface inside of your container, which will be eth0. It's not tied to the host interfaces at all.

Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused

This suggests that BIRD is failing to start on that node, which would makes sense why pods on that node are not functioning. I think this is what we need to diagnose.

Do you have a full calico-node log output (e.g., from kubectl logs) from the affected node I could take a look at? That might have a clue as to why BIRD is crashing or failing to start.

8Iue commented 11 months ago

@caseydavenport Thanks for looking into this, here's the logs of the affected calico-node-<id> pod: http://logpaste.com/vubNfLZm

And also, here are the events happening on the pod:

                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       6m19s                  default-scheduler  Successfully assigned kube-system/calico-node-b24lj to worker00
  Normal   Started         6m15s                  kubelet            Started container calico-node
  Normal   Created         6m15s                  kubelet            Created container calico-node
  Warning  Unhealthy       6m14s                  kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Normal   Killing         6m14s                  kubelet            Stopping container calico-node
  Normal   SandboxChanged  6m12s                  kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          6m12s (x2 over 6m19s)  kubelet            Container image "docker.io/calico/cni:v3.24.5" already present on machine
  Normal   Created         6m12s (x2 over 6m19s)  kubelet            Created container upgrade-ipam
  Normal   Started         6m12s (x2 over 6m19s)  kubelet            Started container upgrade-ipam
  Normal   Started         6m11s (x2 over 6m18s)  kubelet            Started container install-cni
  Normal   Pulled          6m11s (x2 over 6m18s)  kubelet            Container image "docker.io/calico/cni:v3.24.5" already present on machine
  Normal   Created         6m11s (x2 over 6m18s)  kubelet            Created container install-cni
  Normal   Created         6m9s (x2 over 6m16s)   kubelet            Created container mount-bpffs
  Normal   Started         6m9s (x2 over 6m16s)   kubelet            Started container mount-bpffs
  Normal   Pulled          6m9s (x2 over 6m16s)   kubelet            Container image "docker.io/calico/node:v3.24.5" already present on machine
  Normal   Pulled          6m8s (x2 over 6m15s)   kubelet            Container image "docker.io/calico/node:v3.24.5" already present on machine
  Warning  BackOff         71s (x31 over 5m55s)   kubelet            Back-off restarting failed container

calico-node -show-status From a working calico-node pod

For some more info, I've ran the calico-node -show-status command on a calico-node pod that is working fine (for instance running on one of the master nodes), and whilst the affected calico-node pod on the worker node is still running, BGP connections are set as Established:

bird v4 BGP peers
+--------------+-----------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE |   SINCE    |  BGPSTATE   |
+--------------+-----------+-------+------------+-------------+
| 172.16.0.180 | Mesh      | up    | 2023-09-26 | Established |
| 172.16.0.181 | Mesh      | up    | 2023-09-26 | Established |
| 172.16.0.169 | Mesh      | up    | 09:14:00   | Established |
+--------------+-----------+-------+------------+-------------+

Once the affected calico-node gets killed and restarted on CrashLoopBackOff, the BGP peer is then Active:

bird v4 BGP peers
+--------------+-----------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE |   SINCE    |  BGPSTATE   |
+--------------+-----------+-------+------------+-------------+
| 172.16.0.180 | Mesh      | up    | 2023-09-26 | Established |
| 172.16.0.181 | Mesh      | up    | 2023-09-26 | Established |
| 172.16.0.169 | Mesh      | start | 09:16:01   | Active      |
+--------------+-----------+-------+------------+-------------+

Changed livenessProbe

I've also found this open issue which has an open PR here: https://github.com/projectcalico/calico/issues/7472 - I've then tried configuring this livenessProbe command on the calico-node-DaemonSet resource and calico-kube-controllers-Deployment resource:

           livenessProbe:                                                                                                                                                                                                                                                                                                                                            
             exec:                                                                                                                                                                                                                                                                                                                                                   
               command:                                                                                                                                                                                                                                                                                                                                              
               - bash                                                                                                                                                                                                                                                                                                                                                
               - -c                                                                                                                                                                                                                                                                                                                                                  
               - /bin/calico-node -felix-live -bird-live && /usr/bin/birdcl -s /var/run/calico/bird.ctl show status

But again, no luck.

caseydavenport commented 11 months ago

It does appear from both the show-status output and the log that BIRD is actually starting up and running successfully, so it's odd that the liveness probe is indicating that it cannot connect to BIRD. It seems to suggest this might just be a problem with the liveness probe. It seems strange that it would happen just on one node.

Is this a fresh install? Or has it been upgraded from a previous version? Are all nodes running the same version of Calico?

One thing to try would be to just remove the livenessProbe altogether and see if Calico stabilizes. If so, it further suggests that it's just a probe issue of some sort.

8Iue commented 11 months ago

@caseydavenport thanks for the input!


NOTE: Firewall is disabled for the time being, although IPTables have been automatically updated with new Kubernetes rules upon reconfiguring and re-setting the worker node.


It seems strange that it would happen just on one node.

Correct, and this node is the only physical node of the two clusters that we have (staging and production).

This physical worker has multiple interfaces as mentioned before, and the calico-node-DaemonSet has been configured to use the IP_AUTODETECTION_METHOD to overcome this possible issue.


Is this a fresh install? Or has it been upgraded from a previous version? Are all nodes running the same version of Calico?

This is a fresh install on our Production environment, and the worker node (worker00) has been reset (many times) with kubeadm reset, as well as removed all the k8s+calico CNI resources from the worker node multiple times, by running this:

master00~$ kubectl cordon worker00
master00~$ kubectl drain worker00 --ignore-daemonsets --delete-emptydir-data
master00~$ kubectl delete node worker00

worker00~$ kubeadm reset --cri-socket "unix:///run/containerd/containerd.sock" -f
worker00~$ rm -rf /var/lib/etcd /var/lib/kubelet /var/lib/dockershim /var/run/kubernetes /var/lib/cni /etc/cni /opt/cni /etc/kubernetes
worker00~$ shutdown -r now

Upon rebooting the system, all routes and network-config have been removed

Also, all Kubernetes packages have been APT-pinned to the same versions across all nodes (staging and production).
This is the Docker-Image version SHA: docker.io/calico/node@sha256:5972ad2bcbdc668564d3e26960c9c513b2d7b05581c704747cf7c62ef3a405a6

Removed livenessProbe + readinessProbe

I've been troubleshooting this for the past... hours - and I did indeed try to completely remove the livenessProbe and readinessProbe from the calico-node-DaemonSet, but all I'm getting is a SandboxChanged that for some reasons is killing the pod:

Events:                                                                                                                                                                                                                                                                         
  Type     Reason          Age                  From               Message                                                                                                                                                                                                      
  ----     ------          ----                 ----               -------                                                                                                                                                                                                      
  Normal   Scheduled       14m                  default-scheduler  Successfully assigned kube-system/calico-kube-controllers-54fd75b97c-g48bc to worker00                                                                                                                       
  Normal   Created         14m (x2 over 14m)    kubelet            Created container calico-kube-controllers                                                                                                                                                                    
  Normal   Started         14m (x2 over 14m)    kubelet            Started container calico-kube-controllers                                                                                                                                                                    
  Normal   SandboxChanged  13m (x2 over 14m)    kubelet            Pod sandbox changed, it will be killed and re-created.                                                                                                                                                       
  Warning  BackOff         13m (x4 over 13m)    kubelet            Back-off restarting failed container                                                                                                                                                                         
  Normal   Killing         8m21s (x5 over 14m)  kubelet            Stopping container calico-kube-controllers                                                                                                                                                                   
  Normal   Pulled          3m56s (x7 over 14m)  kubelet            Container image "docker.io/calico/kube-controllers:v3.24.5" already present on machine

For extra information, this is the Docker Image being used in the calico-node container: docker.io/calico/node@sha256:5972ad2bcbdc668564d3e26960c9c513b2d7b05581c704747cf7c62ef3a405a6

And in the kubectl describe the Exit code was set to 0, but again, it seems like that the main-process of the container is exiting for whatever reason....

caseydavenport commented 11 months ago

I've been troubleshooting this for the past... hours - and I did indeed try to completely remove the livenessProbe and readinessProbe from the calico-node-DaemonSet, but all I'm getting is a SandboxChanged that for some reasons is killing the pod:

This is at least a clue that it's not terminating because of the probes, rather something is actually going wrong.

If you use kubectl logs passing the -p argument to retrieve the logs from the last running container, is there anything interesting at the very end? Afraid the logpaste link above has expired and I can't recall if it included the -p arg.

Correct, and this node is the only physical node of the two clusters that we have (staging and production).

This does seem like a suspicious difference between this node and the others, but everything you've described so far in terms of configuration and process seems good to me.

8Iue commented 11 months ago

If you use kubectl logs passing the -p argument to retrieve the logs from the last running container, is there anything interesting at the very end? Afraid the logpaste link above has expired and I can't recall if it included the -p arg.

Ah sorry, the link expired and didn't realise it does.

The logs are almost-always ending with the same last line, which is the following:

2023-09-28 09:07:03.157 [INFO][97] felix/health.go 242: Overall health status changed newStatus=&health.HealthReport{Live:true, Ready:true, Detail:"+------------------+---------+----------------+-----------------+--------+\n|    COMPONENT     | TIMEOUT |    LIVENESS    |    READINESS    | DETAIL |\n+------------------+---------+----------------+-----------------+--------+\n| async_calc_graph | 20s     | reporting live | reporting ready |        |\n| felix-startup    | 0s      | reporting live | reporting ready |        |\n| int_dataplane    | 1m30s   | reporting live | reporting ready |        |\n+------------------+---------+----------------+-----------------+--------+"}

This seems to be stating that the the felix is healthy and listening, but it gets killed right after that.

Here are the logs: https://pastebin.com/MQ8CZVrS

This does seem like a suspicious difference between this node and the others, but everything you've described so far in terms of configuration and process seems good to me.

Correct, also the staging has the same configuration as we're deploying k8s via Ansible, but different outcomes between this node and the other 7 nodes.


I've also started to test if adding a new worker01 node that is simply a VM with one network interface attached would have the same issue.

master00~$ kubeadm token create --print-join-command
<command_here>

worker01~$  kubeadm join <endpoint> --token <token> --discovery-token-ca-cert-hash <hash>

The <endpoint> is reachable, so the join works fine:

master00~$ kubectl get nodes -o wide
NAME       STATUS   ROLES                  AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
master00   Ready    control-plane,master   42d     v1.23.4   172.16.0.180   <none>        Ubuntu 22.04.1 LTS   5.15.0-73-generic   containerd://1.7.2
master01   Ready    control-plane,master   42d     v1.23.4   172.16.0.181   <none>        Ubuntu 22.04.1 LTS   5.15.0-73-generic   containerd://1.7.2
master02   Ready    control-plane,master   42d     v1.23.4   172.16.0.182   <none>        Ubuntu 22.04.1 LTS   5.15.0-73-generic   containerd://1.7.2
worker00   Ready    <none>                 5d19h   v1.23.4   172.16.0.169   <none>        Ubuntu 22.04.3 LTS   5.15.0-84-generic   containerd://1.7.2
worker01   Ready    <none>                 144m    v1.23.4   172.16.0.168   <none>        Ubuntu 22.04.3 LTS   5.15.0-84-generic   containerd://1.7.2

However, the Calico is not working on this new VM either, but it seems that I'm having a different issue now:

2023-10-04 11:42:14.337 [FATAL][22] tunnel-ip-allocator/allocateip.go 229: failed to fetch node resource 'worker01' error=Get "https://10.96.0.1:443/api/v1/nodes/worker01": dial tcp 10.96.0.1:443: connect: connection refused
Calico node failed to start

This is definitely not related to the initial error, so let's skip this for now (I'll go through the troubleshooting of this but I think this is firewall related)

8Iue commented 11 months ago

The worker01-calico-node has the same issue, stuck in CrashLoopBackOff, and pretty much same logs: https://pastebin.com/TZmbSQtg


Something I didn't notice initially, is that in the logs happening on worker00-calico-node (the physical server one), I can see this line that I initially overlooked:

2023-09-28 09:06:58.591 [INFO][97] felix/int_dataplane.go 1856: attempted to modprobe nf_conntrack_proto_sctp error=exit status 1 output=""

Found this open issue from which the last comment has the same error-line: https://github.com/projectcalico/calico/issues/5627#issuecomment-1726075725

8Iue commented 11 months ago

@caseydavenport After many hours of investigating as to why the calico-node pods were not running, I've finally found out that for some reason, my kubernetes Ansible role that has been used for deploying the nodes, had a faulty ansible.builtin.lineinfile where it's supposed to change the SystemdCgroup value on the containerd instance from false to true, as this is a requirement for both kubelet and containerd: https://kubernetes.io/docs/setup/production-environment/container-runtimes/#containerd-systemd in the /etc/containerd/config.toml configuration file.

This misconfiguration led to an unhealthy state of pods running on the worker nodes, with no apparent oddities in the logs, masking the underlying problem.

A manual override to set the SystemdCgroup value to true, followed by restarting containerd and kubelet services, and rebooting DaemonSet and Deployments resources on Kubernetes, resolved the issue:

# to run on both worker nodes
worker0[0,1]$ systemctl restart containerd
worker0[0,1]$ systemctl restart kubelet

# to run on the master node
master00$ kubectl rollout restart deployment -n kube-system calico-kube-controllers
master00$ kubectl rollout restart ds -n kube-system calico-node

The calico-node pods transitioned to a healthy state, rectifying the CrashLoopBackOff state, and restoring pod connectivity within the cluster on the affected worker nodes.

The pods have been running for several hours now without issues:

kubectl get pods -A -o wide
NAMESPACE     NAME                                       READY   STATUS    RESTARTS       AGE   IP               NODE       NOMINATED NODE   READINESS GATES
kube-system   calico-kube-controllers-5b7984c9f4-j8pvs   1/1     Running   0              9h    10.244.131.129   worker00   <none>           <none>
kube-system   calico-node-9rtrc                          1/1     Running   1 (14h ago)    14h   172.16.0.168     worker01   <none>           <none>
kube-system   calico-node-s2cvc                          1/1     Running   0              14h   172.16.0.182     master02   <none>           <none>
kube-system   calico-node-tj72m                          1/1     Running   0              14h   172.16.0.180     master00   <none>           <none>
kube-system   calico-node-wdjtg                          1/1     Running   0              14h   172.16.0.181     master01   <none>           <none>
kube-system   calico-node-zlvdg                          1/1     Running   0              14h   172.16.0.169     worker00   <none>           <none>
kube-system   coredns-64897985d-fpmqz                    1/1     Running   0              14h   10.244.5.5       worker01   <none>           <none>
kube-system   coredns-64897985d-tmvnp                    1/1     Running   0              14h   10.244.131.145   worker00   <none>           <none>
kube-system   etcd-master00                              1/1     Running   159            42d   172.16.0.180     master00   <none>           <none>
kube-system   etcd-master01                              1/1     Running   0              42d   172.16.0.181     master01   <none>           <none>
kube-system   etcd-master02                              1/1     Running   0              42d   172.16.0.182     master02   <none>           <none>
kube-system   kube-apiserver-master00                    1/1     Running   213            42d   172.16.0.180     master00   <none>           <none>
kube-system   kube-apiserver-master01                    1/1     Running   4              42d   172.16.0.181     master01   <none>           <none>
kube-system   kube-apiserver-master02                    1/1     Running   4              42d   172.16.0.182     master02   <none>           <none>
kube-system   kube-controller-manager-master00           1/1     Running   17 (42d ago)   42d   172.16.0.180     master00   <none>           <none>
kube-system   kube-controller-manager-master01           1/1     Running   4              42d   172.16.0.181     master01   <none>           <none>
kube-system   kube-controller-manager-master02           1/1     Running   4              42d   172.16.0.182     master02   <none>           <none>
kube-system   kube-proxy-jnjm5                           1/1     Running   0              14h   172.16.0.169     worker00   <none>           <none>
kube-system   kube-proxy-l594r                           1/1     Running   0              42d   172.16.0.181     master01   <none>           <none>
kube-system   kube-proxy-mwr2v                           1/1     Running   0              42d   172.16.0.180     master00   <none>           <none>
kube-system   kube-proxy-rhxg4                           1/1     Running   0              14h   172.16.0.168     worker01   <none>           <none>
kube-system   kube-proxy-wvvj9                           1/1     Running   0              42d   172.16.0.182     master02   <none>           <none>
kube-system   kube-scheduler-master00                    1/1     Running   17 (42d ago)   42d   172.16.0.180     master00   <none>           <none>
kube-system   kube-scheduler-master01                    1/1     Running   4              42d   172.16.0.181     master01   <none>           <none>
kube-system   kube-scheduler-master02                    1/1     Running   4              42d   172.16.0.182     master02   <none>           <none>
kube-system   traefik-ingress-754c7db4cb-7f748           1/1     Running   0              42d   172.16.0.180     master00   <none>           <none>

Thank you again for your help and investigation on this, I honestly didn't even think to check this parameter as it was initially already set to true on all the other nodes.

I'm closing this issue, as this is not related to a Calico error.