github / glb-director

GitHub Load Balancer Director and supporting tooling.
Other
2.37k stars 227 forks source link

glb-director failing host + ecmp/ibgp #115

Closed linecolumn closed 3 years ago

linecolumn commented 3 years ago

Trying to learn as much before trying it out.

How is glb-director announces if one of its hosts are down -- any automatization with bird or other bgp software? What about ecmp, how is it done as well?

Thanks!

theojulienne commented 3 years ago

How is glb-director announces if one of its hosts are down -- any automatization with bird or other bgp software? What about ecmp, how is it done as well?

At GitHub we run ExaBGP and we have a script that is run as an ExaBGP process that monitors the health of the glb-director process and withdraws routes if it becomes unhealthy. If the node fails entirely, that obviously also cuts off the BGP session and withdraws the routes. Multiple director hosts announce the same IPs, so all switches within the datacenter use ECMP to balance requests between the directors.

Hope that answers your questions!

linecolumn commented 3 years ago

That makes sense!

But, looking at glb-director home page network diagram shows "network routers" are to be responsible for ecmp to glb-director(s), not switches. Basically, glb-directors are "talking" via BGP to switches, and switches talks back to glb-directors via ECMP? So, network routers are not connected in the process (except having static route to interface connecting to switch(es)?

Script checking health of glb-director + ExaBGP, is it open sourced as well?

Thanks!

theojulienne commented 3 years ago

Basically, glb-directors are "talking" via BGP to switches, and switches talks back to glb-directors via ECMP? So, network routers are not connected in the process (except having static route to interface connecting to switch(es)?

Although glb-director doesn't enforce any standards for how you route things, in our datacenters the routes are announced via BGP from the glb-director to a leaf switch like this one which does perform routing as well, it then re-announces these routes to other devices in the network (in our case a spine switch), and any packets it receives for the routes it sends to one of the directly connected glb-director machines using ECMP. Most of the time we spread directors between racks, meaning most of the time the bulk of the ECMP work is happening at the spine level (to different leafs).

Script checking health of glb-director + ExaBGP, is it open sourced as well?

It isn't since it's a bit specific to our environment, but ExaBGP does include a healthcheck script that can be used to do simple healthchecking. The simplest example might be to run it with systemctl status glb-director as the check script, but obviously it could be extended to check other things as well.

EriGWorld commented 3 years ago

Closing this issue for now! If needed can be re-opened and continue the discussion.