externalTrafficPolicy not working as expected - calico nodes advertising services for which they don't have endpoints

thomseddon commented 5 years ago

We'd like to make a kubernetes service reachable from outside the cluster (4 servers), on a specific internal IP address. We'd also like to preserve the client source address. As such, we have created the following kubernetes service:

apiVersion: v1
kind: Service
metadata:
  name: example
  labels:
    app: example
spec:
  type: NodePort
  clusterIP: 172.15.100.150
  externalTrafficPolicy: Local
  ports:
  - name: example
    port: 1000
    protocol: UDP
  selector:
    app: example

Each node has a peering with a router that can route to another internal subnet, for example:

apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
  name: bgppeer-router1-uk1
spec:
  asNumber: 65500
  node: server2-uk1-man
  peerIP: 1.1.1.1

We have also added CALICO_ADVERTISE_CLUSTER_IPS.

With this setup, calico-node correctly advertises the /26 pod cidr's to router1 and also correctly advertises the /22 service cidr.

Additionally, the node running the example service (server1), correctly advertises the /32 service IP (172.15.100.150).

However, all other nodes also seem to be advertising the /32 service IP, with their own IP as the next-hop.

The documentation reads:

    the service cluster IP CIDR (for example, 10.96.0.0/12) is advertised from every node in the cluster

:point_up: this is working correctly :+1:

    for each active service with externalTrafficPolicy: Local, the cluster IP for that service is advertised as a /32 route from the nodes that have endpoints for that service.

:point_up: to me this indicates that only the nodes that are running the pod should advertise the service, which is currently not happening.

Looking at the bird route tables, on server1, which has the pod:

BIRD v0.3.3+birdv1.6.3 ready.
bird> show route
...snip...
172.15.100.150/32  blackhole [static1 17:05:12] * (200)
...snip...

On server2, which does not have the pod:

BIRD v0.3.3+birdv1.6.3 ready.
bird> show route
...snip...
172.15.100.150/32  via 1.1.1.18 on bond0 [Mesh_1_1_1_18 17:05:12] * (100/0) [i]
...snip...

On router1, received from server1:

$ show ip bgp neighbors 1.1.1.18 received-routes 
BGP table version is 5422, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

    Network          Next Hop            Metric    LocPrf       Weight Path
*>  172.15.7.64/26   1.1.1.18                             0       65502 i
*>  172.15.8.192/26  1.1.1.18                             0       65502 i
*>  172.15.14.192/26 1.1.1.18                             0       65502 i
*>  172.15.100.0/22  1.1.1.18                             0       65502 i
*>  172.15.100.150/32
                    1.1.1.18                             0       65502 i

On router1, routes from server2:

$ show ip bgp neighbors 1.1.1.19 received-routes 
BGP table version is 5422, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

    Network          Next Hop            Metric    LocPrf       Weight Path
*>  172.15.7.64/26   1.1.1.19                             0       65502 i
*>  172.15.8.192/26  1.1.1.19                             0       65502 i
*>  172.15.14.192/26 1.1.1.19                             0       65502 i
*>  172.15.100.0/22  1.1.1.19                             0       65502 i
*>  172.15.100.150/32
                    1.1.1.19                             0       65502 i

To me, this looks like server2 is learning the route from server1 via the mesh (correct), and it is then re-advertising the route to router1 (incorrect according to my reading of the documentation) with its own address as the next-hop

Expected Behavior

Only nodes running an endpoint for service should advertise a service, or, nodes should only advertise a next-hop of nodes running endpoints for a service.

Current Behavior

All nodes running a service advertise themselves as next-hop for service.

Possible Solution

Don't re-advertise routes learnt via the mesh?

Steps to Reproduce (for bugs)

Setup:
- Create cluster of 2 or more nodes
- Add BGP peer for each node
- Create service as shown above (+ deployment)
Look what routes are advertised from each node

Context

This currently means that nodes that aren't running any endpoints for a service are still receiving traffic for the service, they are correctly forwarding the traffic to the node running the service however they are NAT'ing the traffic in the process, this means that the original client source is lost and the externalTrafficPolicy is effectively not being correctly respected.

Your Environment

calico/node version: Client Version: v3.9.0-0.dev-15-gd1cfdabf Git commit: d1cfdabf Cluster Version: v3.8.0 Cluster Type: k8s,bgp,kdd
Operating System and version: cat /etc/os-release | head -n 1 PRETTY_NAME="Debian GNU/Linux 10 (buster)" iptables --version iptables v1.6.0

caseydavenport commented 5 years ago

@thomseddon I suspect you're encountering the same issue described here: https://github.com/projectcalico/confd/issues/253

Basically, Calico isn't doing the right thing when node-to-node mesh is enabled as well as being peered to a ToR. The fix is to turn off the node-to-node mesh, since route distribution should be done through the ToR.

Is there a reason you need node-node mesh enabled as well?

thomseddon commented 5 years ago

Ah, I hadn't spotted that, it does sound the same.

In this case, the peer is not the ToR and the intention is not for the peer to act as a RR for the cluster. The peer is routing between the cluster and another internal subnet. We'd just like to advertise services to the peer so they're reachable from the other subnet.

If this behavior is intended, then it's not really clear from the documentation that using explicit BGP peers is mutually exclusive with using the node-to-node mesh.

Actually, in my opinion this behavior is at best unnecessary and at worst incorrect.

I have a few ideas about way's this could be improved for all use cases which I'll add to https://github.com/projectcalico/confd/issues/253

cfiske commented 5 years ago

In my view this is a bug. If you take away the node-to-node mesh then the ToR must be in the same AS as the cluster, and therefore must act as RR for the cluster, neither of which is desired (in my case at least). All cluster nodes use the same AS, so if the ToR is in a separate AS (such as in BGP confederations and/or other private AS designs) then BGP loop prevention would keep it from advertising routes from one node to the others in the cluster.

IMO there are two possibilities here:

Each node learns service IP routes from other nodes via iBGP and advertises them out to eBGP neighbors. This is the default behavior of BGP unless a prefix list/filter is applied to either the mesh peers or the global peers. So the bug would be the lack of such filter, resulting in every node advertising the service IP routes instead of just those which have endpoints for them.

or

Individual service IPs are meant be announced from all nodes but the documentation is wrong in stating:

In order to implement this behavior, Calico does the following.
- the service cluster IP CIDR (for example, 10.96.0.0/12) is advertised from every node in the cluster
- for each active service with externalTrafficPolicy: Local, the cluster IP for that service is advertised as a /32 route from the nodes that have endpoints for that service.

In this second case I believe the bug is that multihop is set in the BGP template for all peers (including global ones), and that causes bird to apply next-hop-self behavior instead of advertising the correct node as the next hop.

My preference is the first case. In my experience, indirect routing is a direct road to trouble. If the node hosting the service is unable to announce the IP itself, you probably don't want other nodes announcing on its behalf as something is likely broken.

Off the top of my head, for the first case the fix may be a filter applied to any global BGP peer which only permits static routes and not those learned via [i]BGP.

nelljerram commented 4 years ago

@thomseddon On review, I see that your case here also has a different AS number for the external BIRD. So the explanation is the same as I've just commented about in projectcalico/node#253 .

@cfiske Thanks. If we must support service advertisement to an eBGP peer, I agree that the possible solutions are as you have laid out, or @thomseddon 's suggestion of adding "next hop keep" to the peer config; and I think agree with you that (1) would be the best. The documentation you've mentioned under (2) is definitely what we would like to happen, so that inbound traffic to a service with externalTrafficPolicy: Local can be SNAT-free. In general "next hop keep" would surely be wrong for eBGP peerings, so we'd need to add a BGPPeer field for that solution; and a knob that could be set wrongly is obviously worse than automatically generating the required filters.

But, how important and realistic is the eBGP case? In other words:

                   +----------+
                  _| AS 65502 |___
                 / +----------+   \
             .--'     /     \      '----.
            /        /       \           \
+----------+  +----------+  +----------+  +----------+ 
| AS 65501 |==| AS 65501 |==| AS 65501 |==| AS 65501 | 
+----------+  +----------+  +----------+  +----------+

That looks quite odd to me, when there is also a full mesh between the nodes. @cfiske @thomseddon @mleklund @caseydavenport WDYT?

cfiske commented 4 years ago

I can't speak for the community overall but that is exactly my use case, and one I believe to be reasonable for networks making use of BGP confederations.

In my comment in projectcalico/confd#253 I missed that the peering in question was eBGP. RR is specific to iBGP, as the behavior it provides is the default behavior of eBGP peers: announce every known route (except those whose AS_PATH contain the peer's AS).

I agree with your concerns regarding "next hop keep" and would much prefer the filtering option.

rafaelvanoni commented 4 years ago

Assigning to @neiljerram for triage

caseydavenport commented 3 years ago

I know the consensus above was that an automatic filter would be the ideal solution, but thought I'd chime in to say that there is an option now to keep the original next hop on routes advertised to eBGP peers, which might be useful here.

https://github.com/projectcalico/libcalico-go/blob/master/lib/apis/v3/bgppeer.go#L75-L78

Added here: https://github.com/projectcalico/libcalico-go/pull/1266

projectcalico / calico