Forward new external IP addresses to Dendrite

bnaecker commented 2 years ago

Background

All guest instances will have private IP addresses in their VPC Subnet. To communicate with the outside world, they'll also need external addresses. These addresses ultimately come from operators, for example as part of setting up IP Pools, since they are addresses under the customer's control (e.g., public IPv4 addresses they own or addresses within their datacenter). We're currently passing these addresses to OPTE. When the guest makes an outbound network connection, OPTE will

Rewrite the guest source address (and port) to the provided external IP address
Encapsulate the guest packet, into a packet designed to transit the rack IPv6 network, destined for Boundary Services

The P4 program running on the switch decapsulates this, and delivers it to the broader customer network.

On the way back in, the reverse process needs to happen: encapsulating the external packet in a rack-specific IPv6 packet, destined for the right sled. The Dendrite data-plane daemon, dpd, needs to know what the "right" sled is. This issue tracks the initial work communicating the external-IP-to-sled mapping out to dpd.

Initial thoughts

The control plane needs to communicate the mapping from external IP address to the sled "hosting" that address. This needs to happen in a few places:

When an instance is provisioned. Plopping this request in around here would be a good start, with a corresponding undo action. This part can basically be done now.
When an instance migrates. There's a lot of missing pieces to the migration story, so this one is probably better left for later.
When customers create / assign a new Floating IP address to an instance. That's work tracked in #1334, so we'll come back to flesh this out after that's done.

bnaecker commented 2 years ago

One other thought here. In https://github.com/oxidecomputer/omicron/issues/823, we've been tracking some way of knowing whether any given bootstrap agent is on a Scrimlet. In the product, we'll ultimately know this only when and whether a Tofino PCIe device comes up. In the meantime, using the presence of config-rss.toml has been a stand-in. @smklein Any objections to making that more formal as part of this?

One of the details of this is we'll need to add an is_scrimlet column to the sled table. Nexus will have to find the both agents on the Scrimlets in the same rack as the instance being launched, since it needs to forward the new external IP address mappings to both Sidecars.

Nieuwejaar commented 2 years ago

Related to that: the dpd code currently assumes that the sidecar is already up and running when the daemon is launched. If it doesn't find the /dev/tofino device, the daemon shuts down. This works as long as sled-agent doesn't launch the dendrite service on a scrimlet unless/until that device comes up. If sled-agent launches the service immediately, then the daemon will have to block rather than exiting if it doesn't find the device.

bnaecker commented 2 years ago

@Nieuwejaar Thanks, that's helpful. We can block starting the dpd zone on the existence of the /dev/tofino device. But what happens if that device goes away at some point? Does dpd abort? Does it continue but fail requests on its HTTP server?

smklein commented 2 years ago

One other thought here. In #823, we've been tracking some way of knowing whether any given bootstrap agent is on a Scrimlet. In the product, we'll ultimately know this only be when and whether a Tofino PCIe device comes up. In the meantime, using the presence of config-rss.toml has been a stand-in. @smklein Any objections to making that more formal as part of this?

One of the details of this is we'll need to add an is_scrimlet column to the sled table. Nexus will have to find the both agents on the Scrimlets in the same rack as the instance being launched, since it needs to forward the new external IP address mappings to both Sidecars.

I'd like to have Nexus be fully independent of whatever mechanism sled agent chooses - so if sled agent uses config-rss.toml as the knob, fine, but we should make the mechanism for notifying Nexus "independent" of how we make the "scrimlet/gimlet" call.

For example, when Sled Agent makes the call to Nexus (to upsert a row to the sled table):

https://github.com/oxidecomputer/omicron/blob/b858ed729f1ae8393def124f20e76831cf20f442/nexus/src/internal_api/http_entrypoints.rs#L74-L89

This could have some auxiliary info identifying is_scrimlet: bool.

Then, if we shift away from config-rss.toml, towards a different approach, we can just change the source of how sled agent populates this value.

bnaecker commented 2 years ago

Yeah, for sure. I was assuming sled-agent would have some is_scrimlet() method it calls, which would currently return true iff there's a config-rss.toml file . But that bool would be forwarded to Nexus in exactly the call you've linked, as a separate variable.

Nieuwejaar commented 2 years ago

@Nieuwejaar Thanks, that's helpful. We can block starting the dpd zone on the existence of the /dev/tofino device. But what happens if that device goes away at some point? Does dpd abort? Does it continue but fail requests on its HTTP server?

If the device goes away, then dpd will (should?) crash. Currently, I believe that SMF will attempt to restart dpd, dpd will then crash immediately, and SMF will leave it in "maintenance" mode. Not ideal.

If the device goes away, then we should prevent dpd from restarting until it comes back. Continuing to respond sanely to API requests would be easy, but would also perpetuate the illusion that something was actually working. We kinda touched on this in the meeting today, when Luqman raised the issue of migration mechanics. The failure/resurrection of a sidecar would be the trigger to kick off of the migration events.

rmustacc commented 2 years ago

It's much more helpful to services to have an API server be up and return a 500 when it can't process the request rather than be down and trying to cascade that. As the service will still be listed in DNS and trying to design a system to correctly enable/disable seems like quite a lot of extra work given the SMF maintenance mode behavior.

When we have other entities in a distributed system that have dependents come and go, they just handle it and communicate upstack what's down rather than relying on something to have to be there. I realize that this is a bit different and the daemon initialization and state tracking are very different and we may have to figure out a way to get dpd into a callback notification when the instance wants to close so we can clean up refs, but seems something worth considering. Dunno. On the other hand, if it's going to be up to sled agent to insert the device into the zone every time, maybe we'll have an easier place to enable and disable the daemon and that'll be ok, but I think it'll still be weird since we'll have things that want to create TCP connections for this and make requests and it seems healthier and easier to diagnose what's going on if we can get a semantic 500 of there's no Tofino here versus TCP connection time outs.

Nieuwejaar commented 2 years ago

it seems healthier and easier to diagnose what's going on if we can get a semantic 500 of there's no Tofino here versus TCP connection time outs.

I hadn't thought of that, but it seems obvious now that you've said it. I'm sold.

Nieuwejaar commented 2 years ago

I was just rereading the issue description and noticed "the rack IPv6 network". In a single-rack environment, we need some way to know which of the two sidecars owns the port with the guest's external IP. In a multi-rack environment, we'll need some way to identify the rack on which its external IP exists.

bnaecker commented 2 years ago

Based on the proposed API in RFD 267, like this example, a Sidecar port (called router and device in the API path) are given an address in a subnet, and the IP of the gateway when creating routes that are accessible from that port.

Later in that example, an IP pool is created, and an address range within that subnet is provided. It's not really clear to me how to implement this efficiently or store it all in the control plane database, but technically we have all the information we need there to describe the relationship between a guest external IP address; the pool it's derived from; and the Sidecar port(s) through which traffic destined for that guest must arrive.

While multi-rack is something I've not thought much about in this context, the rack ID is also part of that API path, so we can link up the above pieces with a rack as well.

rcgoodfellow commented 2 years ago

The mapping we are talking about here is

(external-ip, l4-port-begin, l4-port-end) -> (sled-underlay-ip, vpc-vni, guest-mac)

This mapping is realized concretely as the NAT ingress table in the P4 code.

We cannot tie the need for this table entry to the assignment of sidecar IP addresses. Consider the case where an entire subnet of external IPs is routed to the rack and the IP addresses used between the sidecar and the upstream router are unrelated to that subnet. In the example below the customer has designated 10.123.0.0./24 as an IP pool. A routing entry 10.123.0.0/24 -> 172.30.0.2 is added to the customer device to get packets destined to 10.123.0.0/24 to the sidecar.

Customer
Router                                 Sidecar
+=============+                 +=============+
              |                 |
           +-----+           +-----+
172.30.0.1 |     |-----------|     | 172.30.0.2
           +-----+           +-----+
              |                 |
+=============+                 +=============+

route: 10.123.0.0/24 -> 172.30.0.2

It's also becoming increasingly common to route IPv4 addresses over IPv6 link local addresses to avoid burning V4 addresses. RFC 5549.

Nieuwejaar commented 2 years ago

The point I was trying to make is that the table entry has to land in a specific sidecar's Tofino tables. In a multi-rack environment, presumably only a subset of the racks will have external connectivity. Thus, we have to assume that any given guest's NAT traffic will go through a sidecar on a different rack, and something in omicron needs to track which sidecar that is.

rcgoodfellow commented 2 years ago

Yeah, totally. In a multirack setup, I think it makes sense for a sidecar to be designated as a "NAT-ing" sidecar via the API - as I'm not thinking of a good bullet-proof way to determine this dynamically based on some other information. We could use the presence of an egress routing configuration (static, BGP, etc...) as an indicator. However, a sidecar could be used purely for ingress without any egress and that strategy would fall apart.

bnaecker commented 8 months ago

@internet-diglett @FelixMcFelix Has this actually been completed now? I think so, but y'all have done most of the work, so you can answer better!

FelixMcFelix commented 8 months ago

My understanding is that we're in a good place on the v4 front, but as you've indicated via #5090 we aren't yet there for v6.

oxidecomputer / omicron

Forward new external IP addresses to Dendrite #1464

Background

Initial thoughts