projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.04k stars 1.35k forks source link

ARP entries are deleted/recreated periodically when using VXLANCrossSubnet, leading to no connectivity #9450

Open youngderekm opened 2 weeks ago

youngderekm commented 2 weeks ago

Expected Behavior

ARP entries remain, allowing traffic between pods to continue.

Current Behavior

Roughly every 2-3 minutes, on some of our machines, calico-node logs that it is "Deleting ARP entry" for a VXLAN tunnel that should still exist (the nodes in the cluster remain up). Traffic then starts to fail for a few minutes (ping from the host logs "Destination Host Unreachable") until calico-node logs that it is recreating the ARP entry when traffic resumes normally, only to repeat again.

example error: kube-apiserver failing to communicate to pod in the other subnet

2024-11-05T18:22:04.844704626-05:00 stderr F E1105 23:22:04.844615       1 watcher.go:567] failed to prepare current and previous objects: conversion webhook for longhorn.io/v1beta2, Kind=Volume failed: Post "https://longhorn-conversion-webhook.longhorn-system.svc:9501/v1/webhook/conversion?timeout=30s": read tcp 10.82.0.26:37966->10.44.238.49:9501: read: no route to host

neighbors when working:

$ ip neigh show dev vxlan.calico
172.17.193.65 lladdr 66:0a:f0:4a:3f:c1 PERMANENT 
172.17.193.64 lladdr 66:0a:f0:4a:3f:c1 PERMANENT 
172.17.78.0 lladdr 66:f4:cd:d2:0b:14 PERMANENT 
172.17.100.193 lladdr 66:cf:57:4a:02:e4 PERMANENT 
172.17.100.192 lladdr 66:cf:57:4a:02:e4 PERMANENT 
172.17.74.0 lladdr 66:71:ff:a7:1c:3e PERMANENT 
172.17.100.0 lladdr 66:ef:d2:b9:6d:dc PERMANENT 

ping works, then starts to fail:

64 bytes from 172.17.193.79: icmp_seq=27 ttl=63 time=0.308 ms
64 bytes from 172.17.193.79: icmp_seq=28 ttl=63 time=0.310 ms
64 bytes from 172.17.193.79: icmp_seq=29 ttl=63 time=0.300 ms
From 172.17.199.128 icmp_seq=30 Destination Host Unreachable
From 172.17.199.128 icmp_seq=31 Destination Host Unreachable
From 172.17.199.128 icmp_seq=32 Destination Host Unreachable

ARP was removed for 172.17.193.65, 172.17.100.193)

$ ip neigh show dev vxlan.calico
172.17.193.65 INCOMPLETE 
172.17.193.64 lladdr 66:0a:f0:4a:3f:c1 PERMANENT 
172.17.78.0 lladdr 66:f4:cd:d2:0b:14 PERMANENT 
172.17.100.192 lladdr 66:cf:57:4a:02:e4 PERMANENT 
172.17.74.0 lladdr 66:71:ff:a7:1c:3e PERMANENT 
172.17.100.0 lladdr 66:ef:d2:b9:6d:dc PERMANENT 

calico-node debug log showing the deleting:

2024-11-06 17:52:14.361 [DEBUG][268] felix/delta_tracker.go 331: Updated dataplane state. desiredUpdates=0 inDataplaneNotDesired=0 totalNumInDP=5
2024-11-06 18:20:24.128 [DEBUG][268] felix/vxlan_fdb.go 278: Deleting ARP entry. entry=vxlanfdb.ipMACMapping{IP:ip.V4Addr{0xac, 0x11, 0xc1, 0x41}, MAC:net.HardwareAddr{0x66, 0xa, 0xf0, 0x4a, 0x3f, 0xc1}}
2024-11-06 18:20:24.128 [DEBUG][268] felix/vxlan_fdb.go 278: Deleting ARP entry. entry=vxlanfdb.ipMACMapping{IP:ip.V4Addr{0xac, 0x11, 0x64, 0xc1}, MAC:net.HardwareAddr{0x66, 0xcf, 0x57, 0x4a, 0x2, 0xe4}}

later, it adds the ARP entries back:

2024-11-06 18:21:57.187 [DEBUG][268] felix/delta_tracker.go 331: Updated dataplane state. desiredUpdates=0 inDataplaneNotDesired=0 totalNumInDP=5
2024-11-06 18:21:57.187 [DEBUG][268] felix/vxlan_fdb.go 256: Adding ARP/NDP entry. entry=vxlanfdb.ipMACMapping{IP:ip.V4Addr{0xac, 0x11, 0xc1, 0x41}, MAC:net.HardwareAddr{0x66, 0xa, 0xf0, 0x4a, 0x3f, 0xc1}}
2024-11-06 18:21:57.187 [DEBUG][268] felix/vxlan_fdb.go 256: Adding ARP/NDP entry. entry=vxlanfdb.ipMACMapping{IP:ip.V4Addr{0xac, 0x11, 0x64, 0xc1}, MAC:net.HardwareAddr{0x66, 0xcf, 0x57, 0x4a, 0x2, 0xe4}}

routes:

$ ip r
default via 10.82.0.1 dev enX0 proto static metric 100 
10.82.0.0/21 dev enX0 proto kernel scope link src 10.82.0.26 metric 100 
172.17.74.0/26 via 10.82.0.27 dev enX0 proto 80 onlink 
172.17.78.0/26 via 10.82.0.28 dev enX0 proto 80 onlink 
172.17.100.0/26 via 172.17.100.0 dev vxlan.calico onlink 
172.17.100.192/26 via 172.17.100.193 dev vxlan.calico onlink 
172.17.193.64/26 via 172.17.193.65 dev vxlan.calico onlink 
blackhole 172.17.199.128/26 proto 80 
172.17.199.130 dev cali93bca0eec34 scope link 
172.17.199.131 dev calie05465e654b scope link 
172.17.199.132 dev caliafadf6b07b8 scope link 
172.17.199.133 dev cali1d44a92c20d scope link 

Steps to Reproduce (for bugs)

We have a six node cluster with three control plane nodes in one subnet (10.82.0.0/21) and three workers in another subnet (10.82.10.0/24). tigera-operator config:

installation:
  enabled: true
  typhaMetricsPort: 9093
  logging:
    cni:
      logSeverity: Debug
  calicoNetwork:
    ipPools:
    - blockSize: 26
      cidr: "172.17.0.0/16"
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()
      name: initial-ip-pool
defaultFelixConfiguration:
  enabled: true
  prometheusMetricsEnabled: true
  logSeverityScreen: Debug
  # default value normally ^((en|wl|ww|sl|ib)[opsx].*|(eth|wlan|wwan).*)
  # add "X" so enX0 matches, and not just enx0
  mtuIfacePattern: "^((en|wl|ww|sl|ib)[opsxX].*|(eth|wlan|wwan).*)"
apiServer:
  enabled: true

From a control plane node, ping an IP of a pod in a service found to log a "no route to host" error. Or ping the tunnel IP. The ping will fail after some time. This only happens on some of our nodes, not all of them, even though hardware, OS version are the same. This has happened on two separate clusters (running on similar hardware), created with the same automation.

It seems to only happen in the cases where the VXLAN tunnel IP is one higher than the start of the IP block range:

172.17.100.0/26 via 172.17.100.0 dev vxlan.calico onlink     <-- not this one
172.17.100.192/26 via 172.17.100.193 dev vxlan.calico onlink     <---- this one
172.17.193.64/26 via 172.17.193.65 dev vxlan.calico onlink     <---- this one

Your Environment

youngderekm commented 2 weeks ago

Complete logs attached. logs.txt

caseydavenport commented 2 weeks ago

@youngderekm thanks for raising this. From a look at the logs I'm not seeing anything jump out at me, I also wouldn't typically expect the value of the IP address to be relevant here but I suppose it could be.

@fasaxc recently did a pretty large refactor of the VXLAN data plane manager code, so it might be worth him taking a look. Possible that something was regressed here.