oxidecomputer / maghemite

A routing stack written in Rust.
Mozilla Public License 2.0
37 stars 2 forks source link

Need mechanism to avoid blackholing when DDM path between Transit Router and Server Router is lost #368

Open taspelund opened 1 month ago

taspelund commented 1 month ago

Server Routers act solely as stub routers and cannot be used as a transit node for traffic originating from another DDM router. In the current Oxide topology (2 sidecars/Transit Routers + N gimlets/Server Routers), a single backplane link failure can result in blackholing of traffic. i.e. If the link between Sled 0 and Switch 0 goes down, traffic destined for Sled 0 that arrives at Switch 0 will be lost.

 ┌──────────┐   ┌──────────┐
 │          │   │          │
 │ Switch 0 │   │ Switch 1 │
 └─┬─────┬──┘   └──────┬──┬┘
   │     │             │  │ 
   │     └──────────┐  │  │ 
   │                │  │  │ 
   x     ┌──────────┼──┘  │ 
   │     │          │     │ 
  ┌┴─────┴─┐      ┌─┴─────┴┐
  │ Sled 0 │      │ Sled 1 │
  └────────┘      └────────┘

This happens as a result of multiple factors coinciding.

  1. The physical topology inside the Oxide rack is a 3-stage clos uses the "spine" layer as the exit of the fabric. This means there are no alternative paths to get from an exit to a leaf node that don't involve crossing an additional number of links (e.g. spine0 -> leaf0 = 1 link, vs spine0 -> leaf1 -> spine1 -> leaf0 = 3 links).
  2. DDM does not allow Server Routers to be used for transit, which preserves the "valley free" property of the network by disallowing the 3-link routing path mentioned in the above bullet point. (We also likely wouldn't want to use Server Routers for transit because of the implications it would have on gimlet CPU load, network bandwidth, etc.)
  3. The exit nodes are effectively doing Northbound aggregation of individual External IPs via BGP advertisements. The External IPs are assigned to instances via omicron, which can/will be dispersed across the Overlay, effectively partitioning the External IP subnets used in the rack. Without exposing the dis-aggregation of these External IPs (advertising /32 or /128 routes for each External IP in use) Northbound via BGP, there is no way for the Northbound network to see or react to this failure.

We need a mechanism to properly handle this failure case.

taspelund commented 1 month ago

Some ideas on mechanisms:

  1. Add BGP support for dynamically learning and exporting host routes covering active External IPs (leaking dis-aggregated info northbound)
  2. Enable the use of DDM over front-panel ports. Today this could be used to connect Switch0 and Switch1 in the same rack (changing topology to add a less-preferred path used only in failure conditions), but could be a generalized mechanism eventually reused for multi-rack.

For (1) I imagine we could either enable this without, or in addition to, an announce-set. If we enable dynamic host advertisements in addition to an announce-set, we could consider dynamically adding the NO_EXPORT (or possibly NO_PEER) community to these host routes to limit the scope of propagation into the wider external network. However, if there is no announce-set covering the External IPs, we would want the host routes to propagate through the greater external network and likely would not want to stamp the routes with NO_EXPORT or NO_PEER.