Open youngderekm opened 2 weeks ago
Complete logs attached. logs.txt
@youngderekm thanks for raising this. From a look at the logs I'm not seeing anything jump out at me, I also wouldn't typically expect the value of the IP address to be relevant here but I suppose it could be.
@fasaxc recently did a pretty large refactor of the VXLAN data plane manager code, so it might be worth him taking a look. Possible that something was regressed here.
Expected Behavior
ARP entries remain, allowing traffic between pods to continue.
Current Behavior
Roughly every 2-3 minutes, on some of our machines, calico-node logs that it is "Deleting ARP entry" for a VXLAN tunnel that should still exist (the nodes in the cluster remain up). Traffic then starts to fail for a few minutes (ping from the host logs "Destination Host Unreachable") until calico-node logs that it is recreating the ARP entry when traffic resumes normally, only to repeat again.
example error: kube-apiserver failing to communicate to pod in the other subnet
neighbors when working:
ping works, then starts to fail:
ARP was removed for 172.17.193.65, 172.17.100.193)
calico-node debug log showing the deleting:
later, it adds the ARP entries back:
routes:
Steps to Reproduce (for bugs)
We have a six node cluster with three control plane nodes in one subnet (10.82.0.0/21) and three workers in another subnet (10.82.10.0/24). tigera-operator config:
From a control plane node, ping an IP of a pod in a service found to log a "no route to host" error. Or ping the tunnel IP. The ping will fail after some time. This only happens on some of our nodes, not all of them, even though hardware, OS version are the same. This has happened on two separate clusters (running on similar hardware), created with the same automation.
It seems to only happen in the cases where the VXLAN tunnel IP is one higher than the start of the IP block range:
Your Environment