projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.9k stars 1.31k forks source link

Calico CNI refuses to start on a specific node #9049

Open develeap-daniel opened 2 months ago

develeap-daniel commented 2 months ago

I’m unable to initialize calico CNI on a bare metal node

Kubernetes cluster was deployed using kubespray, all nodes are healthy and working properly except for a specific node. |602x205

Calico-node (pod) on that node does not start, it is blocked by one of the init containers

|602x288

I’ve set up busybox pods in the node alongside a healthy node to check connectivity to api-server and found no connection (intra-cluster) which should be obvious since the CNI is not up.

Network details:

Routes Compared:

|602x221

OS: Ubuntu 18.04 (calicoctrl approved the node) calico version: 3.27.2

caseydavenport commented 2 months ago

Looks like the CNI installer is failing to reach the API server on that node, which should occur using the Service IP from the host network namespace (i.e., not requiring pod networking to be functioning yet).

Is the kube-proxy healthy on that node? Are you able to access the API server via it's service IP (10.233.0.1) from that node otherwise?

develeap-daniel commented 2 months ago

@caseydavenport kube-proxy is healthy, nothing suspicious in the logs either. I really have no idea what is going on

taha-adel commented 2 months ago

Try the below command on the Unhealthy node

telnet 10.233.0.1 443

If it times out, try the below command and send me the output

sudo conntrack -L | grep 10.233.0.1
develeap-daniel commented 2 months ago

@taha-adel

sudo conntrack -L | grep 10.233.0.1 tcp 6 71 SYN_SENT src=10.233.0.1 dst=10.233.0.1 sport=39366 dport=443 [UNREPLIED] src=10.10.10.9 dst=10.233.0.1 sport=6443 dport=39366 mark=0 use=1 tcp 6 116 SYN_SENT src=10.233.0.1 dst=10.233.0.1 sport=49314 dport=443 [UNREPLIED] src=10.10.10.9 dst=10.233.0.1 sport=6443 dport=49314 mark=0 use=1 conntrack v1.4.4 (conntrack-tools): 284 flow entries have been shown.

taha-adel commented 1 month ago

@develeap-daniel, I had the same issue and it's fixed after restarting the Unhealthy node.

develeap-daniel commented 1 month ago

@taha-adel We tried to restart the machine it didn't help :(

caseydavenport commented 3 weeks ago

Any progress on this issue?