Closed rcgoodfellow closed 9 months ago
I gave this a bit more thought this evening. We may not need to track multiple boundary services addresses in OPTE.
If we use an anycast address for boundary services, then we get route availability tracking for free from ddmd
. If an upstream sidecar goes away, so does the ddm peer, and the local ddmd
instance running on the sled will remove the routes going through that peer when it expires. Because xde
is now choosing nexthops based on host routing tables, this will all just work automatically.
We still have the issue of overlay/underlay path affinity to deal with in OPTE. But with an anycast-based approach, this becomes one of route/nexthop affinity, and not boundary services tunnel endpoint address affinity, which should be a much simpler overall mechanism to introduce.
Currently a
VpcCfg
has a singleBoundaryServices
member which in turn has a single target IP address as a tunnel endpoint (TEP) for sending packets to upstream networks.This works for a single-switch environment. However, the rack is multi-switch and OPTE needs to be explicitly aware of that. Unlike sled-to-sled communications which are constrained to the rack underlay network, we do not control the physical paths packets will take beyond the boundary services TEP. This means that we need to maintain overlay path affinity on the underlying physical network (e.g. there is no end-to-end ddm protocol ensuring packet level balancing properties that preserve ordering across flows) and the only component in a position to do that currently is OPTE.
OPTE must be aware of the fact that there are multiple boundary service TEPs, and assign flows to a particular TEP. OPTE must also respond to signals about TEP availability. For example, if a sidecar goes down due to a failure or a maintenance event, OPTE must migrate flows off that TEP to another one that is available. Presumably, these signals will come from the control plane and I'll post a corresponding issue in Omicron about that.