Open bradbehle opened 3 years ago
@bradbehle Thanks for the details! I can see from the log VXLAN tunnel device has been configured.
2021-03-14 00:12:33.539 [INFO][44] vxlan_mgr.go 497: Assigning address to VXLAN device address=172.16.129.0/32
2021-03-14 00:12:33.539 [INFO][44] vxlan_mgr.go 355: VXLAN tunnel device configured
This implies the device is up https://github.com/projectcalico/felix/blob/release-v3.13/dataplane/linux/vxlan_mgr.go#L462
Not sure why it stuck in "DOWN" state. If you simply bring the interface DOWN, Calico should bring it up again. I think it would be hard to debug unless you can repo exact same scenarios.
@song-jiang Thanks for looking at this. I don't know how to get the node in this state again, so yes, I agree it would be hard to troubleshoot exactly why this is happening. I was just thinking that since calico-node can detect when this happens, and puts an error in the log like this:
2021-03-14 00:09:32.557 [WARNING][44] route_table.go 604: Failed to add route error=network is down ifaceName="vxlan.calico" ipVersion=0x4
that calico-node could be improved to instead fix the interface (maybe just bring it back up, or delete it and recreate it) as a way to auto-recover.
We had another instance of this today after a reboot.
This happened after a reboot.
Is there any way we could see if the VXLAN interface has been DOWN for X amount of time and if so delete it (so it gets autocreated) as at least a mitigation?
@relyt0925 that sounds like a possible solution. I think we need to investigate what we can do with the library we're using to manage the VXLAN interfaces to say for sure though.
I wanted to check in with you on steps to replicate the issue. I would have thought that rebooting the node (or manually bringing up the VXLAN interface) would fix the issue. When you mention that this occurred again after you rebooted, do you mean rebooting the calico-node pods or the node itself?
@mgleung the node itself was rebooted. After the reboot although the calico pod went healthy and was 1/1 running the vxlan interface was permanently in DOWN state. No commands link ip set link up
would fix it either: however once deleted: on the next calico reconciliation loop a new vxlan interface is created and everything works.
It's a little nasty because although everything looks healthy (livliness probes and readiness probes pass locally on the node) no traffic can traverse the sdn (which can cause hard to find DNS issues, traffic routing issues, etc)
After hours of debugging i finally found this issue, thank you! :pray:
I am using microk8s and on all nodes the vxlan interfaces were down. The behavior for me was exactly like @relyt0925, traffic on the same node was no problem, all pods showed healthy, but no traffic between pods on different nodes. I deleted the vxlan interface (ip link delete vxlan.calico
), which fixed it.
Happy to help @dani-CO-CN we also saw another occurrence of this today when upgrading rhel 8 nodes.
Had a similar problem after reinstalling microk8s. Calico failed constantly to refresh the route table. Solved it by deleting the vxlan.calico
and all calicoXXX
interfaces via ip link delete <LINK>
Also ran into this issue - everything looks totally healthy except routing just doesn't work. Doing ip link set up
didn't work, had to delete it and let the tigera operator re-create it. That worked, but this should probably be automated?
Pod traffic was not successfully flowing to a node. Specifically I was failing to hit a port a pod was listening on over the pod ip. It is a simple HTTP server that just returns a 200 when hit. I was originally seeing timeouts from a pod on multiple separate nodes trying to contact this pod IP.
I jumped on the node holding the pod and noticed that the vxlan overlay was permanently in a state of DOWN
AND that through TCP dump the encapsulated VXLAN pod traffic was being dropped due to it (tcpdump of eth0)
I tried restarting calico pods (both typha and node) to no avail. What I ultimately had to do to get it right was actually run the following manually on the node:
ip link delete vxlan.calico
(runningip link set vxlan.calico up
did not work) and then restarting the calico pod. After that it spun up and began processing traffic.Here are the calico-node logs.
calico-container-logs-after-restart.txt
We see log lines such as:
2021-03-14 00:09:32.557 [WARNING][44] route_table.go 604: Failed to add route error=network is down ifaceName="vxlan.calico" ipVersion=0x4
indicating that calico-node knows that the interface is down or in a bad state, but it does not try to recreate it.
Expected Behavior
calico-node should be able to determine if the vxlan interface is down permanently and recover that interface.
Current Behavior
When vxlan interface is in a bad state, pod traffic to/from that node is just dropped, and it is hard to determine what the problem is. Restarting calico-node does not fix the problem.
Possible Solution
Have calico-node check the state of the interface, and if it is in a bad state, recreate the interface
Steps to Reproduce (for bugs)
The key is to get the vxlan interface into a bad state, but we have not figured out how to do it. Possibly just putting the interface in a "DOWN" state might let this be recreated?
Context
When this happens, the affected node in our k8s cluster is isolated from the calico pod network, and we are not notified and must take manual steps to find, troubleshoot and then recover the affected node
Your Environment