samcday / home-cluster

10 stars 0 forks source link

BGP multi-path #629

Closed samcday closed 2 months ago

samcday commented 2 months ago

Right now quagga bgpd is only picking one "best" path for each address advertised from the cluster.

samcday commented 2 months ago

I didn't realize FRR was trivially available on openwrt. So I've switched to using it. The docs are infinitely better. This whole software suite is kinda challenging to work with, though.

Anyway, I've managed to muddle my way through a basic FRR setup that configures the router as a "Route Server". AFAICT when in that mode you can only use "route maps". I feel like I understand the basics of these, in principle, but I couldn't seem to properly filter the incoming advertisements from the cluster to just 10.0.2.0/24, so I figured "fuggit" and for now any advertisement from the cluster is valid.

The initial basic config I came up with seemed to do the ECMP thing I expected straight out of the box:

root@home-cluster-router:~# ip route
default via 192.168.178.1 dev wan proto static src 192.168.178.21 
10.0.1.0/24 dev br-lan proto kernel scope link src 10.0.1.1 
10.0.2.1 nhid 113 proto bgp metric 20 
    nexthop via 10.0.1.11 dev br-lan weight 1 
    nexthop via 10.0.1.12 dev br-lan weight 1 
    nexthop via 10.0.1.13 dev br-lan weight 1 
10.0.2.2 nhid 114 proto bgp metric 20 
    nexthop via 10.0.1.11 dev br-lan weight 1 
    nexthop via 10.0.1.12 dev br-lan weight 1 
10.0.2.3 nhid 114 proto bgp metric 20 
    nexthop via 10.0.1.11 dev br-lan weight 1 
    nexthop via 10.0.1.12 dev br-lan weight 1 
10.0.2.4 nhid 94 via 10.0.1.10 dev br-lan proto bgp metric 20 
10.0.2.5 nhid 115 proto bgp metric 20 
    nexthop via 10.0.1.11 dev br-lan weight 1 
    nexthop via 10.0.1.12 dev br-lan weight 1 
    nexthop via 10.0.1.13 dev br-lan weight 1 
    nexthop via 10.0.1.10 dev br-lan weight 1 
    nexthop via 10.0.1.14 dev br-lan weight 1 
192.168.178.0/24 dev wan proto kernel scope link src 192.168.178.21 

This is becoming quite the rabbit-hole though because when testing for n in $(seq 0 10); do curl https://alertmanager.samcday.com; done, all traffic still appears to be hitting a single ingress-nginx pod. Hrmph.

samcday commented 2 months ago

Ah, well, that all traffic in that test flowed through one node is to be expected. That's literally how ECMP works, AFAIU. So actually Maglev seems to be a similar concept - basically the source+dest address/port (the "flow") is hashed to a specific path from all available paths.

I've proven this to myself simply by doing a traceroute to 10.0.2.1 from the router, and from my desktop - they hit different (consistent) paths.

I wonder how I would go about completely distributing the traffic at the packet level. Because the Cilium L4LB is in Maglev mode, in theory this should work (that is, packets in a single flow should be able to traverse any/random path and end up at the same destination).

It seems like that's not a very popular/common thing to do, presumably because without something like Maglev on the upstream LB tier you'd have all sorts of weird fragmentation issues.

samcday commented 2 months ago

Alright so I kept pushing on this (weirdly niche?) topic and made a little more progress. This Super User answer crystallized it. Tuning net.ipv4.fib_multipath_hash_policy=1 instructs the kernel to do 5-tuple routing at L4, rather than the default at L3.

This means that my contrived for loop; do curl a.bgp.service; done example does indeed spread traffic across all available ECMP paths much more than it was before, because now it is hashing the L4 source+dst ports and not just the L3 addresses. A single TCP connection is still pinned to a single path from router -> cluster, but I'm okay with this for now.

I might be closer than I thought to being able to run a HA pair of the cluster routers, which could be kinda fun/neat.