The root cause in our setup seems to be that the VXLAN device is not used as routes are incorrectly set by Calico:
# ip -6 r
2001:db8::1 dev bpfin.cali metric 1024 pref medium
fd01:cafe::/64 dev eth0 proto ra metric 1024 mtu 65520 hoplimit 255 pref medium
fd80:cafe::5f:4281 dev cali3b862d89810 metric 1024 pref medium
fd80:cafe::5f:4282 dev cali9d386f561d2 metric 1024 pref medium
fd80:cafe::5f:4283 dev cali856dc622126 metric 1024 pref medium
blackhole fd80:cafe::5f:4280/122 dev lo proto 80 metric 1024 pref medium
fd80:cafe::a8:4d00/122 via fd01:cafe::f14c:9fa1:8496:5550 dev eth0 proto 80 metric 1024 onlink pref medium
fd85:cafe::a via 2001:db8::1 dev bpfin.cali src fd01:cafe::4aab:d761:d808:996 metric 1024 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev cali3b862d89810 proto kernel metric 256 pref medium
fe80::/64 dev cali9d386f561d2 proto kernel metric 256 pref medium
fe80::/64 dev bpfout.cali proto kernel metric 256 pref medium
fe80::/64 dev bpfin.cali proto kernel metric 256 pref medium
fe80::/64 dev cali856dc622126 proto kernel metric 256 pref medium
default dev eth0 proto static metric 1000 pref medium
default via fe80::ecee:eeff:feee:eeee dev eth0 proto ra metric 1024 expires 65136sec mtu 65520 hoplimit 255 pref medium
We were expecting fd80:cafe::a8:4d00/122 via fd01:cafe::f14c:9fa1:8496:5550 dev eth0 proto 80 metric 1024 onlink pref medium to go through the vxlan interface and not be directly routed to the second node (fd01...).
@matthewdupre Hopefully that helps. We are currently working around this issue by using flannel, but would prefer calico.
DNS was not working, but the underlying issue was deeper and general pod-to-pod traffic was not working between pods on different nodes.
Config looked as follows:
The root cause in our setup seems to be that the VXLAN device is not used as routes are incorrectly set by Calico:
We were expecting
fd80:cafe::a8:4d00/122 via fd01:cafe::f14c:9fa1:8496:5550 dev eth0 proto 80 metric 1024 onlink pref medium
to go through the vxlan interface and not be directly routed to the second node (fd01...
).@matthewdupre Hopefully that helps. We are currently working around this issue by using flannel, but would prefer calico.
Originally posted by @trevex in https://github.com/projectcalico/calico/issues/8811#issuecomment-2421779071