projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.05k stars 1.35k forks source link

IPIP mode does not support ECMP routing #4462

Open JimmyMa opened 3 years ago

JimmyMa commented 3 years ago

In my cluster, the ipipMode is Always for IP Pool 198.19.0.0/16, as below:

apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
  name: controlplane-services-cidr
spec:
  blockSize: 26
  cidr: 198.19.0.0/16
  disabled: true
  ipipMode: Always
  nodeSelector: all()
  vxlanMode: Never

When there are multiple next hops, below routes are generated, and they are not using tunl0

198.19.0.0/16 proto bird
    nexthop via 10.240.1.57 dev ens3 weight 1
    nexthop via 10.240.1.58 dev ens3 weight 1

When there is only one next hop, below route is generated, and it's using tunl0:

198.19.0.0/16 via 10.240.3.49 dev tunl0 proto bird onlink

Expected Behavior

I hope it generates the routes with tunl0 as below when there are multiple next hops

198.19.0.0/16 proto bird
    nexthop via 10.240.1.57 dev tunl0 weight 1
    nexthop via 10.240.1.58 dev tunl0 weight 1

Context

I have two k8s clusters, and each cluster has a node as route reflector, and the two route reflectors are peered. Each cluster broadcasts its service cidr to other cluster, and I need all traffic are IPIP.

Your Environment

caseydavenport commented 3 years ago

@JimmyMa interesting. I think this scenario is a bit outside the set of use-cases that Calico typically handles, but it might be workable.

If I understand correctly, you have two clusters with the same Service CIDR, and you want to advertise ECMP routes for that Service CIDR so that traffic to a service IP is split equally between the two clusters?

If I had to guess, I would say we probably haven't implemented IPIP route programming for ECMP routes in our BIRD code, since I don't think we expected ECMP routing to ever occur for IPIP mode.

CC @neiljerram

nelljerram commented 3 years ago

@caseydavenport We have internal tracking for this at https://tigera.atlassian.net/browse/CNX-10379. I'm afraid @JimmyMa won't be able to see that directly, but the summary is exactly as you say: we made some Calico-specific patches to BIRD to handle IP-IP routes, and unfortunately those patches don't work in the ECMP case.

JimmyMa commented 3 years ago

@caseydavenport @neiljerram thank you for the comments. I think this change https://github.com/projectcalico/confd/pull/379 enabled the ECMP in bird, but the generated routes in kernel are incorrect for ipipMode.

nelljerram commented 3 years ago

@JimmyMa Yes, that confd change enables BIRD to program ECMP routes into the kernel. But we are still missing some support in the BIRD code.

nelljerram commented 3 years ago

@JimmyMa I've created https://github.com/projectcalico/bird/issues/90 (an issue in our BIRD fork repo) to publish everything that we know about why the ECMP + IP-IP combination does not work. Please take a look and feel free to comment or to contribute towards possible solutions.

nelljerram commented 1 year ago

@JimmyMa I have been thinking about this problem again, and am wondering if the apparent problem is in fact solved by the routing for 10.240.3.49. In other words, if there is a single path route for the pod block

198.19.0.0/16 via 10.240.3.49 dev tunl0 proto bird onlink

but there is an ECMP route to get to 10.240.3.49, such as

10.240.3.49 proto bird
    nexthop via 10.240.1.57 dev ens3 weight 1
    nexthop via 10.240.1.58 dev ens3 weight 1

then perhaps we would see repeated connections to a pod in 198.19.0.0/16 (with different source ports, and assuming fib_multipath_hash_policy=1) using both underlying ECMP paths, as a result of the 2-step routing resolution.

If so, doesn't that mean that the single path IPIP route here is actually fine?

nelljerram commented 1 year ago

I tested this last week - although with a VXLAN overlay instead of IPIP - and it broadly appears that it does work as suggested in my previous comment.

I created a test server pod running

nc -l -k 10.244.195.197 8888

and a test client pod running

for sp in 30001 30002 30003 30004 30005 30006 30007 30008 30009; do echo hello$sp | nc -N -p $sp 10.244.195.197 8888; done

and used tcpdump to observe traffic through the two NICs of the client pod's node

tcpdump -i eth0 -n -v  udp port 4789 and dst 172.31.20.3
tcpdump -i eth1 -n -v  udp port 4789 and dst 172.31.20.3

10.244.195.197 is the IP of the server pod and 172.31.20.3 is the stable IP of the server pod's node.

Good observations:

  1. tcpdumps showed that both NICs were being used.
  2. When I disabled eth0 on the source node and repeated the test, all the connections succeeded using eth1.

However, I expected that the connection for a given source port would reliably use the same NIC for all of its outbound packets, and that was not the case. Instead - for example - I saw the outbound SYN go through eth1, but then the outbound SYN ACK would go through eth0, and then the data packet for that connection would also go through eth0.

More research is needed to understand why that happens, instead of seeing a reliable association from 4-tuple to NIC.