VXLAN interface in permanently bad state on node and does not autorecover even with calico-node restart

bradbehle commented 3 years ago

Pod traffic was not successfully flowing to a node. Specifically I was failing to hit a port a pod was listening on over the pod ip. It is a simple HTTP server that just returns a 200 when hit. I was originally seeing timeouts from a pod on multiple separate nodes trying to contact this pod IP.

bash-4.4$ curl http://172.16.129.27:20000
curl: (7) Failed to connect to 172.16.129.27 port 20000: Connection timed out

I jumped on the node holding the pod and noticed that the vxlan overlay was permanently in a state of DOWN

[root@ip-172-31-70-156 /]# ip addr
7: vxlan.calico: <BROADCAST,MULTICAST> mtu 8951 qdisc noqueue state DOWN group default 
    link/ether 66:0e:8d:3d:df:29 brd ff:ff:ff:ff:ff:ff

AND that through TCP dump the encapsulated VXLAN pod traffic was being dropped due to it (tcpdump of eth0)

tcpdump -nn -i eth0 port 4789
IP 172.16.101.43.36584 > 172.16.129.28.20000: Flags [S], seq 2267978307, win 26733, options [mss 8911,sackOK,TS val 10837704 ecr 0,nop,wscale 9], length 0
00:23:35.523455 IP 172.31.75.107.60135 > 172.31.70.156.4789: VXLAN, flags [I] (0x08), vni 4096

I tried restarting calico pods (both typha and node) to no avail. What I ultimately had to do to get it right was actually run the following manually on the node: ip link delete vxlan.calico (running ip link set vxlan.calico up did not work) and then restarting the calico pod. After that it spun up and began processing traffic.

Here are the calico-node logs.

calico-container-logs-after-restart.txt

We see log lines such as:

2021-03-14 00:09:32.557 [WARNING][44] route_table.go 604: Failed to add route error=network is down ifaceName="vxlan.calico" ipVersion=0x4

indicating that calico-node knows that the interface is down or in a bad state, but it does not try to recreate it.

Expected Behavior

calico-node should be able to determine if the vxlan interface is down permanently and recover that interface.

Current Behavior

When vxlan interface is in a bad state, pod traffic to/from that node is just dropped, and it is hard to determine what the problem is. Restarting calico-node does not fix the problem.

Possible Solution

Have calico-node check the state of the interface, and if it is in a bad state, recreate the interface

Steps to Reproduce (for bugs)

The key is to get the vxlan interface into a bad state, but we have not figured out how to do it. Possibly just putting the interface in a "DOWN" state might let this be recreated?

Context

When this happens, the affected node in our k8s cluster is isolated from the calico pod network, and we are not notified and must take manual steps to find, troubleshoot and then recover the affected node

Your Environment

Calico version: 3.13 using VXLAN encapsulation
Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes 1.18.16
Operating System and version: RHEL workers on AWS
Link to your project (optional):

song-jiang commented 3 years ago

@bradbehle Thanks for the details! I can see from the log VXLAN tunnel device has been configured.

2021-03-14 00:12:33.539 [INFO][44] vxlan_mgr.go 497: Assigning address to VXLAN device address=172.16.129.0/32
2021-03-14 00:12:33.539 [INFO][44] vxlan_mgr.go 355: VXLAN tunnel device configured

This implies the device is up https://github.com/projectcalico/felix/blob/release-v3.13/dataplane/linux/vxlan_mgr.go#L462

Not sure why it stuck in "DOWN" state. If you simply bring the interface DOWN, Calico should bring it up again. I think it would be hard to debug unless you can repo exact same scenarios.

bradbehle commented 3 years ago

@song-jiang Thanks for looking at this. I don't know how to get the node in this state again, so yes, I agree it would be hard to troubleshoot exactly why this is happening. I was just thinking that since calico-node can detect when this happens, and puts an error in the log like this:

2021-03-14 00:09:32.557 [WARNING][44] route_table.go 604: Failed to add route error=network is down ifaceName="vxlan.calico" ipVersion=0x4

that calico-node could be improved to instead fix the interface (maybe just bring it back up, or delete it and recreate it) as a way to auto-recover.

relyt0925 commented 3 years ago

We had another instance of this today after a reboot.

relyt0925 commented 3 years ago

This happened after a reboot.

relyt0925 commented 3 years ago

Is there any way we could see if the VXLAN interface has been DOWN for X amount of time and if so delete it (so it gets autocreated) as at least a mitigation?

mgleung commented 3 years ago

@relyt0925 that sounds like a possible solution. I think we need to investigate what we can do with the library we're using to manage the VXLAN interfaces to say for sure though.

I wanted to check in with you on steps to replicate the issue. I would have thought that rebooting the node (or manually bringing up the VXLAN interface) would fix the issue. When you mention that this occurred again after you rebooted, do you mean rebooting the calico-node pods or the node itself?

relyt0925 commented 3 years ago

@mgleung the node itself was rebooted. After the reboot although the calico pod went healthy and was 1/1 running the vxlan interface was permanently in DOWN state. No commands link ip set link up would fix it either: however once deleted: on the next calico reconciliation loop a new vxlan interface is created and everything works.

It's a little nasty because although everything looks healthy (livliness probes and readiness probes pass locally on the node) no traffic can traverse the sdn (which can cause hard to find DNS issues, traffic routing issues, etc)

dani-CO-CN commented 2 years ago

After hours of debugging i finally found this issue, thank you! :pray:
I am using microk8s and on all nodes the vxlan interfaces were down. The behavior for me was exactly like @relyt0925, traffic on the same node was no problem, all pods showed healthy, but no traffic between pods on different nodes. I deleted the vxlan interface (ip link delete vxlan.calico), which fixed it.

relyt0925 commented 2 years ago

Happy to help @dani-CO-CN we also saw another occurrence of this today when upgrading rhel 8 nodes.

clowa commented 6 months ago

Had a similar problem after reinstalling microk8s. Calico failed constantly to refresh the route table. Solved it by deleting the vxlan.calico and all calicoXXX interfaces via ip link delete <LINK>

UmanShahzad commented 2 months ago

Also ran into this issue - everything looks totally healthy except routing just doesn't work. Doing ip link set up didn't work, had to delete it and let the tigera operator re-create it. That worked, but this should probably be automated?

projectcalico / calico