Open JimmyMa opened 3 years ago
@JimmyMa interesting. I think this scenario is a bit outside the set of use-cases that Calico typically handles, but it might be workable.
If I understand correctly, you have two clusters with the same Service CIDR, and you want to advertise ECMP routes for that Service CIDR so that traffic to a service IP is split equally between the two clusters?
If I had to guess, I would say we probably haven't implemented IPIP route programming for ECMP routes in our BIRD code, since I don't think we expected ECMP routing to ever occur for IPIP mode.
CC @neiljerram
@caseydavenport We have internal tracking for this at https://tigera.atlassian.net/browse/CNX-10379. I'm afraid @JimmyMa won't be able to see that directly, but the summary is exactly as you say: we made some Calico-specific patches to BIRD to handle IP-IP routes, and unfortunately those patches don't work in the ECMP case.
@caseydavenport @neiljerram thank you for the comments. I think this change https://github.com/projectcalico/confd/pull/379 enabled the ECMP in bird, but the generated routes in kernel are incorrect for ipipMode.
@JimmyMa Yes, that confd change enables BIRD to program ECMP routes into the kernel. But we are still missing some support in the BIRD code.
@JimmyMa I've created https://github.com/projectcalico/bird/issues/90 (an issue in our BIRD fork repo) to publish everything that we know about why the ECMP + IP-IP combination does not work. Please take a look and feel free to comment or to contribute towards possible solutions.
@JimmyMa I have been thinking about this problem again, and am wondering if the apparent problem is in fact solved by the routing for 10.240.3.49. In other words, if there is a single path route for the pod block
198.19.0.0/16 via 10.240.3.49 dev tunl0 proto bird onlink
but there is an ECMP route to get to 10.240.3.49, such as
10.240.3.49 proto bird
nexthop via 10.240.1.57 dev ens3 weight 1
nexthop via 10.240.1.58 dev ens3 weight 1
then perhaps we would see repeated connections to a pod in 198.19.0.0/16 (with different source ports, and assuming fib_multipath_hash_policy=1
) using both underlying ECMP paths, as a result of the 2-step routing resolution.
If so, doesn't that mean that the single path IPIP route here is actually fine?
I tested this last week - although with a VXLAN overlay instead of IPIP - and it broadly appears that it does work as suggested in my previous comment.
I created a test server pod running
nc -l -k 10.244.195.197 8888
and a test client pod running
for sp in 30001 30002 30003 30004 30005 30006 30007 30008 30009; do echo hello$sp | nc -N -p $sp 10.244.195.197 8888; done
and used tcpdump to observe traffic through the two NICs of the client pod's node
tcpdump -i eth0 -n -v udp port 4789 and dst 172.31.20.3
tcpdump -i eth1 -n -v udp port 4789 and dst 172.31.20.3
10.244.195.197
is the IP of the server pod and 172.31.20.3
is the stable IP of the server pod's node.
Good observations:
However, I expected that the connection for a given source port would reliably use the same NIC for all of its outbound packets, and that was not the case. Instead - for example - I saw the outbound SYN go through eth1, but then the outbound SYN ACK would go through eth0, and then the data packet for that connection would also go through eth0.
More research is needed to understand why that happens, instead of seeing a reliable association from 4-tuple to NIC.
In my cluster, the ipipMode is Always for IP Pool 198.19.0.0/16, as below:
When there are multiple next hops, below routes are generated, and they are not using tunl0
When there is only one next hop, below route is generated, and it's using tunl0:
Expected Behavior
I hope it generates the routes with
tunl0
as below when there are multiple next hopsContext
I have two k8s clusters, and each cluster has a node as route reflector, and the two route reflectors are peered. Each cluster broadcasts its service cidr to other cluster, and I need all traffic are IPIP.
Your Environment