Want mechanisms for specifying external addresses that are allowed to talk to rack services

Most of the external rack networking is currently geared towards customers that have private IP ranges. For example, consider Nexus or external DNS. These services listen on an IP external to the rack, as opposed to the underlay, relying on OPTE and the switch to manage tunneling that traffic through the rack. We've sort of been assuming that the external IP customers give us for these services is in some private subnet, and that the customer themselves controls any program that can reach our services from that subnet.

However, we now have customers who have no on-premises infrastructure of their own to speak of, relying entirely on cloud vendors. They have no private IP ranges that they can delegate to us. Instead, they have a completely public IP space on the open Internet, within which we'll need to run Oxide services like Nexus.

To make things slightly more secure, customers will want to limit the degree to which any old program on the public Internet can talk to these rack services. Specifically, we need a way of specifying an allowlist of the source IPs that can make requests to Nexus or external DNS. That way, customers can set up a firewall-like policy for limiting traffic to only sources they control.

This work is pretty big, so this issue will be updated and extended as we get into implementation. But here are some initial pieces we'll need:

At the bottom, we would like to translate an allowlist into the standard OPTE firewall rules. We already have mechanisms for restricting traffic to known source IPs or subnets, which we want to leverage. Initially, we'll limit the filters that can be expressed to just source IPs or subnets, but that can still be translated into a firewall rule.
We need a way of targeting these rules at all the Oxide rack services. One possible way to express this is "the Oxide services project", which we have hard-coded in the rack and which things like Nexus and external DNS live. That is useful partially to avoid explicitly listing the services the rules apply to, so that we can automatically pick them up on any new services that we add.
There must be a Nexus API for CRUD on these allowlists. While the exact interface isn't know yet, customers need to specify a list of allowed IPs, IP ranges, and subnets that are on the allowlist (or to be removed from it). There's going to be some tables backing this data.
RSS needs to be able to express these rules. We would like to have this protection in place from the first moment Nexus starts serving traffic, so it must be part of RSS.
These rules will need to be propagated as reconfigurator moves services around.
There needs to be a "failed closed" disposition, in that the actual OPTE rules should probably be phrased as a low-priority deny-everything rule, with an overriding rule with higher priority that allows specific IPs.
We should think carefully about how updates to these allowlists work. We want to make it very hard to accidentally cut the rack off from all external connectivity because a customer mistyped a subnet, for example. One idea that came up in a call was failing any request that disallows the IP from which the request itself originates. Another was a deadman, that asks users to confirm the new rules within a short time, although that might be too complicated to do at this point.

Update time.

I've got the API mostly in place thanks to @ahl's help with schemars and friends. At the end of the day, we'll have a CLI interface like:

$ oxide system networking allowed-source-ips view 
{ "allow" : "any" }
$ oxide system networking allowed-source-ips update --allow list --ips ["1.2.3.4", "5.6.7.8/9"]
$ oxide system networking allowed-source-ips view 
{ "allow" : "list", "ips" : ["1.2.3.4", "5.6.7.8/9"] }

I spent some time with @andrewjstone and @jgallagher talking bootstore, and I think we can avoid this ever showing up there at all. They had the great point that Nexus is really two servers: an internal and external. This list only applies to the external, so we need it in place before that server starts. However, we have just such a convenient point here:

https://github.com/oxidecomputer/omicron/blob/e810f2e0590d06a4593fca8d191a201442945b9c/nexus/src/lib.rs#L423-L424

At the point we start the first server, we have all the information we need to form the OPTE firewall rules that implement this policy. That will come from RSS -> sled-agent bootstrap server -> internal Nexus server -> CRDB. At which point, before we start the external Dropshot server itself, ask the sled-agent to insert our own OPTE firewall rules that implement this policy on the relevant ports.

We may actually want to do this here:

https://github.com/oxidecomputer/omicron/blob/e810f2e0590d06a4593fca8d191a201442945b9c/nexus/src/lib.rs#L136-L144

specifically right after the call to await_rack_initialization(). There, we're sure to have all the data in CRDB (well, I can make sure we do...), so Nexus can form the required rules and send them to their "managing" sled-agents for plumbing.

oxidecomputer / omicron

Want mechanisms for specifying external addresses that are allowed to talk to rack services #5640