Open bnaecker opened 2 years ago
One other thought here. In https://github.com/oxidecomputer/omicron/issues/823, we've been tracking some way of knowing whether any given bootstrap agent is on a Scrimlet. In the product, we'll ultimately know this only when and whether a Tofino PCIe device comes up. In the meantime, using the presence of config-rss.toml
has been a stand-in. @smklein Any objections to making that more formal as part of this?
One of the details of this is we'll need to add an is_scrimlet
column to the sled
table. Nexus will have to find the both agents on the Scrimlets in the same rack as the instance being launched, since it needs to forward the new external IP address mappings to both Sidecars.
Related to that: the dpd
code currently assumes that the sidecar
is already up and running when the daemon is launched. If it doesn't find the /dev/tofino
device, the daemon shuts down. This works as long as sled-agent
doesn't launch the dendrite
service on a scrimlet
unless/until that device comes up. If sled-agent
launches the service immediately, then the daemon will have to block rather than exiting if it doesn't find the device.
@Nieuwejaar Thanks, that's helpful. We can block starting the dpd
zone on the existence of the /dev/tofino
device. But what happens if that device goes away at some point? Does dpd
abort? Does it continue but fail requests on its HTTP server?
One other thought here. In #823, we've been tracking some way of knowing whether any given bootstrap agent is on a Scrimlet. In the product, we'll ultimately know this only be when and whether a Tofino PCIe device comes up. In the meantime, using the presence of
config-rss.toml
has been a stand-in. @smklein Any objections to making that more formal as part of this?One of the details of this is we'll need to add an
is_scrimlet
column to thesled
table. Nexus will have to find the both agents on the Scrimlets in the same rack as the instance being launched, since it needs to forward the new external IP address mappings to both Sidecars.
I'd like to have Nexus be fully independent of whatever mechanism sled agent chooses - so if sled agent uses config-rss.toml
as the knob, fine, but we should make the mechanism for notifying Nexus "independent" of how we make the "scrimlet/gimlet" call.
For example, when Sled Agent makes the call to Nexus (to upsert a row to the sled
table):
This could have some auxiliary info identifying is_scrimlet: bool
.
Then, if we shift away from config-rss.toml
, towards a different approach, we can just change the source of how sled agent populates this value.
Yeah, for sure. I was assuming sled-agent would have some is_scrimlet()
method it calls, which would currently return true iff there's a config-rss.toml
file . But that bool would be forwarded to Nexus in exactly the call you've linked, as a separate variable.
@Nieuwejaar Thanks, that's helpful. We can block starting the
dpd
zone on the existence of the/dev/tofino
device. But what happens if that device goes away at some point? Doesdpd
abort? Does it continue but fail requests on its HTTP server?
If the device goes away, then dpd
will (should?) crash. Currently, I believe that SMF will attempt to restart dpd
, dpd
will then crash immediately, and SMF will leave it in "maintenance" mode. Not ideal.
If the device goes away, then we should prevent dpd
from restarting until it comes back. Continuing to respond sanely to API requests would be easy, but would also perpetuate the illusion that something was actually working. We kinda touched on this in the meeting today, when Luqman raised the issue of migration mechanics. The failure/resurrection of a sidecar would be the trigger to kick off of the migration events.
It's much more helpful to services to have an API server be up and return a 500 when it can't process the request rather than be down and trying to cascade that. As the service will still be listed in DNS and trying to design a system to correctly enable/disable seems like quite a lot of extra work given the SMF maintenance mode behavior.
When we have other entities in a distributed system that have dependents come and go, they just handle it and communicate upstack what's down rather than relying on something to have to be there. I realize that this is a bit different and the daemon initialization and state tracking are very different and we may have to figure out a way to get dpd into a callback notification when the instance wants to close so we can clean up refs, but seems something worth considering. Dunno. On the other hand, if it's going to be up to sled agent to insert the device into the zone every time, maybe we'll have an easier place to enable and disable the daemon and that'll be ok, but I think it'll still be weird since we'll have things that want to create TCP connections for this and make requests and it seems healthier and easier to diagnose what's going on if we can get a semantic 500 of there's no Tofino here versus TCP connection time outs.
it seems healthier and easier to diagnose what's going on if we can get a semantic 500 of there's no Tofino here versus TCP connection time outs.
I hadn't thought of that, but it seems obvious now that you've said it. I'm sold.
I was just rereading the issue description and noticed "the rack IPv6 network". In a single-rack environment, we need some way to know which of the two sidecars owns the port with the guest's external IP. In a multi-rack environment, we'll need some way to identify the rack on which its external IP exists.
Based on the proposed API in RFD 267, like this example, a Sidecar port (called router
and device
in the API path) are given an address in a subnet, and the IP of the gateway when creating routes that are accessible from that port.
Later in that example, an IP pool is created, and an address range within that subnet is provided. It's not really clear to me how to implement this efficiently or store it all in the control plane database, but technically we have all the information we need there to describe the relationship between a guest external IP address; the pool it's derived from; and the Sidecar port(s) through which traffic destined for that guest must arrive.
While multi-rack is something I've not thought much about in this context, the rack ID is also part of that API path, so we can link up the above pieces with a rack as well.
The mapping we are talking about here is
(external-ip, l4-port-begin, l4-port-end) -> (sled-underlay-ip, vpc-vni, guest-mac)
This mapping is realized concretely as the NAT ingress table in the P4 code.
We cannot tie the need for this table entry to the assignment of sidecar IP addresses. Consider the case where an entire subnet of external IPs is routed to the rack and the IP addresses used between the sidecar and the upstream router are unrelated to that subnet. In the example below the customer has designated 10.123.0.0./24
as an IP pool. A routing entry 10.123.0.0/24 -> 172.30.0.2
is added to the customer device to get packets destined to 10.123.0.0/24
to the sidecar.
Customer
Router Sidecar
+=============+ +=============+
| |
+-----+ +-----+
172.30.0.1 | |-----------| | 172.30.0.2
+-----+ +-----+
| |
+=============+ +=============+
route: 10.123.0.0/24 -> 172.30.0.2
It's also becoming increasingly common to route IPv4 addresses over IPv6 link local addresses to avoid burning V4 addresses. RFC 5549.
The point I was trying to make is that the table entry has to land in a specific sidecar's Tofino tables. In a multi-rack environment, presumably only a subset of the racks will have external connectivity. Thus, we have to assume that any given guest's NAT traffic will go through a sidecar on a different rack, and something in omicron
needs to track which sidecar that is.
Yeah, totally. In a multirack setup, I think it makes sense for a sidecar to be designated as a "NAT-ing" sidecar via the API - as I'm not thinking of a good bullet-proof way to determine this dynamically based on some other information. We could use the presence of an egress routing configuration (static, BGP, etc...) as an indicator. However, a sidecar could be used purely for ingress without any egress and that strategy would fall apart.
@internet-diglett @FelixMcFelix Has this actually been completed now? I think so, but y'all have done most of the work, so you can answer better!
My understanding is that we're in a good place on the v4 front, but as you've indicated via #5090 we aren't yet there for v6.
Background
All guest instances will have private IP addresses in their VPC Subnet. To communicate with the outside world, they'll also need external addresses. These addresses ultimately come from operators, for example as part of setting up IP Pools, since they are addresses under the customer's control (e.g., public IPv4 addresses they own or addresses within their datacenter). We're currently passing these addresses to OPTE. When the guest makes an outbound network connection, OPTE will
The P4 program running on the switch decapsulates this, and delivers it to the broader customer network.
On the way back in, the reverse process needs to happen: encapsulating the external packet in a rack-specific IPv6 packet, destined for the right sled. The Dendrite data-plane daemon,
dpd
, needs to know what the "right" sled is. This issue tracks the initial work communicating the external-IP-to-sled mapping out todpd
.Initial thoughts
The control plane needs to communicate the mapping from external IP address to the sled "hosting" that address. This needs to happen in a few places: