Open thomseddon opened 5 years ago
@thomseddon I suspect you're encountering the same issue described here: https://github.com/projectcalico/confd/issues/253
Basically, Calico isn't doing the right thing when node-to-node mesh is enabled as well as being peered to a ToR. The fix is to turn off the node-to-node mesh, since route distribution should be done through the ToR.
Is there a reason you need node-node mesh enabled as well?
Ah, I hadn't spotted that, it does sound the same.
In this case, the peer is not the ToR and the intention is not for the peer to act as a RR for the cluster. The peer is routing between the cluster and another internal subnet. We'd just like to advertise services to the peer so they're reachable from the other subnet.
If this behavior is intended, then it's not really clear from the documentation that using explicit BGP peers is mutually exclusive with using the node-to-node mesh.
Actually, in my opinion this behavior is at best unnecessary and at worst incorrect.
I have a few ideas about way's this could be improved for all use cases which I'll add to https://github.com/projectcalico/confd/issues/253
In my view this is a bug. If you take away the node-to-node mesh then the ToR must be in the same AS as the cluster, and therefore must act as RR for the cluster, neither of which is desired (in my case at least). All cluster nodes use the same AS, so if the ToR is in a separate AS (such as in BGP confederations and/or other private AS designs) then BGP loop prevention would keep it from advertising routes from one node to the others in the cluster.
IMO there are two possibilities here:
or
In order to implement this behavior, Calico does the following.
- the service cluster IP CIDR (for example, 10.96.0.0/12) is advertised from every node in the cluster
- for each active service with externalTrafficPolicy: Local, the cluster IP for that service is advertised as a /32 route from the nodes that have endpoints for that service.
In this second case I believe the bug is that multihop
is set in the BGP template for all peers (including global ones), and that causes bird
to apply next-hop-self behavior instead of advertising the correct node as the next hop.
My preference is the first case. In my experience, indirect routing is a direct road to trouble. If the node hosting the service is unable to announce the IP itself, you probably don't want other nodes announcing on its behalf as something is likely broken.
Off the top of my head, for the first case the fix may be a filter applied to any global BGP peer which only permits static routes and not those learned via [i]BGP.
@thomseddon On review, I see that your case here also has a different AS number for the external BIRD. So the explanation is the same as I've just commented about in projectcalico/node#253 .
@cfiske Thanks. If we must support service advertisement to an eBGP peer, I agree that the possible solutions are as you have laid out, or @thomseddon 's suggestion of adding "next hop keep" to the peer config; and I think agree with you that (1) would be the best. The documentation you've mentioned under (2) is definitely what we would like to happen, so that inbound traffic to a service with externalTrafficPolicy: Local
can be SNAT-free. In general "next hop keep" would surely be wrong for eBGP peerings, so we'd need to add a BGPPeer field for that solution; and a knob that could be set wrongly is obviously worse than automatically generating the required filters.
But, how important and realistic is the eBGP case? In other words:
+----------+
_| AS 65502 |___
/ +----------+ \
.--' / \ '----.
/ / \ \
+----------+ +----------+ +----------+ +----------+
| AS 65501 |==| AS 65501 |==| AS 65501 |==| AS 65501 |
+----------+ +----------+ +----------+ +----------+
That looks quite odd to me, when there is also a full mesh between the nodes. @cfiske @thomseddon @mleklund @caseydavenport WDYT?
I can't speak for the community overall but that is exactly my use case, and one I believe to be reasonable for networks making use of BGP confederations.
In my comment in projectcalico/confd#253 I missed that the peering in question was eBGP. RR is specific to iBGP, as the behavior it provides is the default behavior of eBGP peers: announce every known route (except those whose AS_PATH contain the peer's AS).
I agree with your concerns regarding "next hop keep" and would much prefer the filtering option.
Assigning to @neiljerram for triage
I know the consensus above was that an automatic filter would be the ideal solution, but thought I'd chime in to say that there is an option now to keep the original next hop on routes advertised to eBGP peers, which might be useful here.
https://github.com/projectcalico/libcalico-go/blob/master/lib/apis/v3/bgppeer.go#L75-L78
Added here: https://github.com/projectcalico/libcalico-go/pull/1266
We'd like to make a kubernetes service reachable from outside the cluster (4 servers), on a specific internal IP address. We'd also like to preserve the client source address. As such, we have created the following kubernetes service:
Each node has a peering with a router that can route to another internal subnet, for example:
We have also added
CALICO_ADVERTISE_CLUSTER_IPS
.With this setup, calico-node correctly advertises the /26 pod cidr's to router1 and also correctly advertises the /22 service cidr.
Additionally, the node running the example service (server1), correctly advertises the /32 service IP (172.15.100.150).
However, all other nodes also seem to be advertising the /32 service IP, with their own IP as the next-hop.
The documentation reads:
:point_up: this is working correctly :+1:
:point_up: to me this indicates that only the nodes that are running the pod should advertise the service, which is currently not happening.
Looking at the bird route tables, on server1, which has the pod:
On server2, which does not have the pod:
On router1, received from server1:
On router1, routes from server2:
To me, this looks like server2 is learning the route from server1 via the mesh (correct), and it is then re-advertising the route to router1 (incorrect according to my reading of the documentation) with its own address as the next-hop
Expected Behavior
Only nodes running an endpoint for service should advertise a service, or, nodes should only advertise a next-hop of nodes running endpoints for a service.
Current Behavior
All nodes running a service advertise themselves as next-hop for service.
Possible Solution
Don't re-advertise routes learnt via the mesh?
Steps to Reproduce (for bugs)
Context
This currently means that nodes that aren't running any endpoints for a service are still receiving traffic for the service, they are correctly forwarding the traffic to the node running the service however they are NAT'ing the traffic in the process, this means that the original client source is lost and the
externalTrafficPolicy
is effectively not being correctly respected.Your Environment
calico/node version: Client Version: v3.9.0-0.dev-15-gd1cfdabf Git commit: d1cfdabf Cluster Version: v3.8.0 Cluster Type: k8s,bgp,kdd
Operating System and version: cat /etc/os-release | head -n 1 PRETTY_NAME="Debian GNU/Linux 10 (buster)" iptables --version iptables v1.6.0