projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.02k stars 1.34k forks source link

BGP: advertise node's PodCIDR and not Pod IPs #8374

Open defo89 opened 10 months ago

defo89 commented 10 months ago

Expected Behavior

I wasn't able to find expected behaviour in existing issues - apologies if I missed an obvious config thing.

Given that Pod IP's don't move between nodes, it would make sense for each node to advertise it's pod subnet for easier troubleshooting and smaller routing table sizes in upstream switching fabrics.

Current Behavior

Currently the routing table has individual Pod IPs and no /24 PodCIDR. All /32's are exported upstream.

server # netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.10.10.129    0.0.0.0         UG        0 0          0 if2
10.10.10.128    0.0.0.0         255.255.255.224 U         0 0          0 if2
100.64.13.2     0.0.0.0         255.255.255.255 UH        0 0          0 cali343cea7bb3a
100.64.13.3     0.0.0.0         255.255.255.255 UH        0 0          0 cali7cb9a43f1b0
100.64.13.4     0.0.0.0         255.255.255.255 UH        0 0          0 cali7d65444adde
100.64.13.5     0.0.0.0         255.255.255.255 UH        0 0          0 calie0b1fbb70f6
100.64.13.6     0.0.0.0         255.255.255.255 UH        0 0          0 cali06122732368
100.64.13.7     0.0.0.0         255.255.255.255 UH        0 0          0 cali3593906f27f
100.64.13.8     0.0.0.0         255.255.255.255 UH        0 0          0 calic95b1fbb2d3

Upstream switch/bgp peer gets:

switch # show bgp vrf net-mgmt:net-mgmt ipv4 unicast neighbors 10.10.10.136 routes

Peer 10.10.10.136 routes for address family IPv4 Unicast:
BGP table version is 1973, local router ID is 10.10.10.253
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup

   Network            Next Hop            Metric     LocPrf     Weight Path
*>i10.11.11.224/27    10.10.10.136                      100          0 i
*>i100.64.13.2/32     10.10.10.136                      100          0 i
*>i100.64.13.3/32     10.10.10.136                      100          0 i
*>i100.64.13.4/32     10.10.10.136                      100          0 i
*>i100.64.13.5/32     10.10.10.136                      100          0 i
*>i100.64.13.6/32     10.10.10.136                      100          0 i
*>i100.64.13.7/32     10.10.10.136                      100          0 i
*>i100.64.13.8/32     10.10.10.136                      100          0 i

Possible Solution

Advertise an aggregate/summary that is equal to node's .spec.podCIDR

Your Environment

Native routing with BGP, no IPIP/vxlan, pod ipam is host-local.

› calicoctl get nodes server -ojson | jq .status
{
  "podCIDRs": [
    "100.64.13.0/24"
  ]
}
› kubectl get nodes server -ojsonpath={".spec.podCIDR"}
100.64.13.0/24
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
  name: default-ipv4-ippool
spec:
  allowedUses:
  - Workload
  blockSize: 26  << is actually /24 ipam not done by calico
  cidr: 100.64.0.0/15
  disableBGPExport: false
  ipipMode: Never
  natOutgoing: false
  nodeSelector: all()
  vxlanMode: Never
  disabled: true
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
  name: default
spec:
  logSeverityScreen: Info
  nodeToNodeMeshEnabled: false
  asNumber: 65001
  listenPort: 179
  bindMode: NodeIP
defo89 commented 10 months ago

While digging through the issues I stumbled across this comment https://github.com/projectcalico/calico/issues/2900#issuecomment-537688234

After adding USE_POD_CIDR=true to calico/node DS and typha deployment, I observe the intended behaviour: /32s pod IPs are not announced and /24 podCIDR prefix is.

switch # show bgp vrf net-mgmt:net-mgmt ipv4 unicast neighbors 10.10.10.136 routes

Peer 10.10.10.136 routes for address family IPv4 Unicast:
BGP table version is 1973, local router ID is 10.10.10.253
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup

   Network            Next Hop            Metric     LocPrf     Weight Path
*>i10.11.11.224/27    10.10.10.136                      100          0 i
*>i100.64.13.0/24     10.10.10.136                      100          0 i

I hope this is an expected fix and this option will not be deprecated (since I could not find it in docs)?

caseydavenport commented 10 months ago

Curious that it isn't in the docs, but yes - if you're using host-local IPAM and not using Calico IPAM, then you should set USE_POD_CIDR=true to get the behavior you're expecting. It isn't a deprecated field. You can also see that our tigera/operator code sets the env var: https://github.com/tigera/operator/blob/1ccfdba7c9ddc0eb558e5829998546a6d9aedf64/pkg/render/node.go#L1478-L1485

Sounds like a case of missing docs - perhaps after the somewhat recent docs refactor (CC @ctauchen)

For completeness, when using Calico IPAM, route aggregation happens without the need for this environment variable because Calico is in control of where IPs and blocks are allocated within the cluster and doens't need to be told to use the node.Spec.PodCIDR alternative.

defo89 commented 10 months ago

Thank you @caseydavenport, also for confirming the non-deprecation of the field.