Updating the technician port policy

mkeeter commented 5 months ago

Right now, the technician port (TP) is on a VLAN that grants it access to every SP (see configure_vlan_semistrict).

This is intentional: without this access, the only way to an SP is TP → Monorail → Tofino → Scrimlet → Tofino → Monorail → SP. If we don't have confidence that the Tofino + Scrimlet will come up with 100% reliability (and we don't!), a TP → SP path is necessary for system recovery.

However, it means that many of our security properties don't hold against an attacker with access to the TP:

The management network can reflash any SP or host phase1 image
Once on the management network, the console proxy gives access to the root shell of any sled

Of course, anyone with physical access to the rack could – instead of plugging into the TP – just remove a Gimlet and reflash the SP firmware with a programmer. Still, it's not great that TP → SP security is notably weaker than our TP → Scrimlet requirements, which use a special support Yubikey to authenticate the user.

So, what's the right policy here?

One option is to lock down the TP, and try very hard not to break the Monorail ↔ Tofino ↔ Scrimlet path. The SP reset watchdog (https://github.com/oxidecomputer/hubris/pull/1707) makes this slightly safer, but it's still somewhat dangerous. For example, we do our best to reboot the Sidecar's SP "live", without reconfiguring the FPGA; this means that a latent issue may not be seen on the initial reset. It's also possible for sled-agent bugs to make this path inaccessible, which happened literally last week
Another option is some kind of timeout, e.g. "if the Sidecar is in A0, the PCIe link to the Scrimlet is up, and we haven't seen any MGS messages in 5 minutes, reconfigure the TPs so that all SPs can be accessed"
We could connect the TP only to the Sidecar SP. That seems like a weak solution, because the Sidecar SP could be trivially reprogrammed with firmware that opens up the TP on all other ports.

jclulow commented 5 months ago

Is it possible to have the SP require some kind of asymmetric key based authentication for control plane agent messages? Presumably we're keen to do that anyway.

jgallagher commented 5 months ago

Is it possible to have the SP require some kind of asymmetric key based authentication for control plane agent messages? Presumably we're keen to do that anyway.

That (or something like it) would be https://github.com/oxidecomputer/hubris/issues/723 / https://github.com/oxidecomputer/management-gateway-service/issues/152. Definitely something we want, not something we've designed or are actively working on at the moment.

lzrd commented 5 months ago

(This step would only be an enhancement to a proper solution.) Only part of a solution since it limits or impedes some access, but TP connected to only a Sidecar sounds good. The rack is not going anywhere without Sidecar and though tedious, sleds can be swapped into a scrimlet position if it comes to that.

jclulow commented 5 months ago

I guess the benefit of only being able to access the sidecar is that, at least in theory, we'd notice if someone then tried to access anything else? (Because they would have had to replace the software in the sidecar to do so)

jgallagher commented 5 months ago

TP connected only to the sidecar would only give us some recovery possibilities. If there's a bug in sled-agent that prevents the switch zone from coming up (as we had a week ago), access to the sidecar can't help us; we need the console proxy to the scrimlet's SP to recover, and swapping scrimlets might or might not help. I guess we could ourselves do the unfortunate thing in this case and replace the sidecar SP image with one that does let us access the scrimlet SP?

lzrd commented 5 months ago

It sounds like we need a "rack provisioning mode" and/or a "rack debug mode". The provisioning mode would be the default delivery mode. Once provisioned sufficiently in the DC, it would switch to production mode with a highly restricted if not disabled TP. Transition out of production mode would have to be through the normal UI with proper authorization and may remove some access to production data. TP in production mode might still have access to basic monitoring and identification if it can be secured.

So, some very complex flows are possible, but what is the MVP for having a secure TP when the customer has production workloads running?

jclulow commented 5 months ago

TP connected only to the sidecar would only give us some recovery possibilities. If there's a bug in sled-agent that prevents the switch zone from coming up (as we had a week ago), access to the sidecar can't help us; we need the console proxy to the scrimlet's SP to recover, and swapping scrimlets might or might not help. I guess we could ourselves do the unfortunate thing in this case and replace the sidecar SP image with one that does let us access the scrimlet SP?

Yeah, I guess to be clear, when I was asking about the benefit, I meant: what is the benefit of making any change at all right now, if we're basically leaving the access mechanism totally open to someone who can get and apply new sidecar SP software (which, it must be said, is anyone, as it's open source!). If the benefit is "well, at least someone would have to replace the software" rather than "well, someone could get in without even having to do that", I suppose that's basically fine -- it's at least a potential change that the control plane or an operator could notice is occurring.

mkeeter commented 1 month ago

See RFD 492 for further discussion

mkeeter commented 4 days ago

Done in #1859

oxidecomputer / hubris

Updating the technician port policy #1713