softnpu could be packaged as a zone, like other omicron components

jordanhendricks commented 12 months ago

Deploying omicron on systems without real Oxide switch(es) requires softnpu, which simulates the switch. Today, softnpu is installed separate of other omicron services, by the "virtual hardware" scripts.

As long as engineers will want to deploy omicron on systems without a real oxide switch (which will probably be a very long time, at least), we will need softnpu; as such, it should be treated as a first class component the way other zones are.

smklein commented 12 months ago

This was intentional, for what it's worth. SoftNPU is not meant to represent a control plane zone, it's supposed to represent a piece of hardware (the tofino switch) that should ambiently exist on the system. If we could have it show up as a PCI device and not even appear as a zone visible to the control plane, we would!

To be clear, if the sled agent managed the lifecycle of the zone explicitly, that would be the equivalent of the sled agent "being able to destroy or create the tofino out of thin air", which would be odd (and not realistic for our production environment).

@internet-diglett can provide more context, too -- but this was why, for example, it doesn't follow the "oxz_" naming convention used by other zones.

jordanhendricks commented 12 months ago

@smklein I didn't intend to suggest that sled agent would manage it. But I think there is still a middle ground between what exists today and that.

smklein commented 12 months ago

What did you mean by "it should be treated as a first class component the way other zones are"?

jordanhendricks commented 12 months ago

I'm not exactly sure yet what it would look like; I was mostly filing this as a placeholder from incremental work laid out in RFD 411. At a minimum, I could imagine the softnpu having an smf manifest to start its services, or even being "self-assembled zone" as I've heard you refer to it (as opposed to the create-softnpu-zone.sh script doing the setup).

For context, I've spent a lot of time debugging the softnpu zone not coming up properly due to cruft on the system (perhaps due to the virtual hw scripts).

smklein commented 12 months ago

Gotcha. Yeah, having it self-assembling would be nice, for sure!

internet-diglett commented 11 months ago

This was intentional, for what it's worth. SoftNPU is not meant to represent a control plane zone, it's supposed to represent a piece of hardware (the tofino switch) that should ambiently exist on the system. If we could have it show up as a PCI device and not even appear as a zone visible to the control plane, we would!

💯 - this is a concise summary of the discussion originally had when we added softnpu so that we could get rid of the OPTE hack.

@jordanhendricks

I'm not exactly sure yet what it would look like; I was mostly filing this as a placeholder from incremental work laid out in RFD 411. At a minimum, I could imagine the softnpu having an smf manifest to start its services, or even being "self-assembled zone" as I've heard you refer to it (as opposed to the create-softnpu-zone.sh script doing the setup).

For context, I've spent a lot of time debugging the softnpu zone not coming up properly due to cruft on the system (perhaps due to the virtual hw scripts).

Would a SMF manifest alleviate some of the problems you have faced? Do you have an examples of cruft on the system that kept the softnpu zone from deploying correctly? There has been some transition work done to move to npuzone that @rcgoodfellow implemented a while back, which seems to have simplified the softnpu zone creation process, but I don't know if they resolve some of the current concerns you have.

davepacheco commented 11 months ago

When I think of a "first-class component" here, I could imagine a lot of things:

it's deployed automatically with the rest of the system (if configured to do so, obviously -- which is not in production)
it's configured the way other components in the system are configured (for better and worse this basically comes from RSS today)
you can see what version is running using the same APIs and tools as you'd use for components like Nexus, CockroachDB, etc. (once we build those)
it can be updated using the same APIs and tools as you'd use to update those components, too (once we build those)
it has resource limits applied to it and the resources that it uses (disk, CPU, memory, etc.) get accounted-for like the resources used by these other components (eventually?)
its logs get rotated and archived
you could create a zone bundle to debug it
you can use the same tools to inspect/debug it (e.g., svcs, zoneadm, etc.)
it could export metrics that get saved in Clickhouse and could be viewed in the usual ways
etc.

That's not to say these are all important for this particular zone and it's certainly not all urgent! But the thought here was: we have invested and will continue to invest in infrastructure to do all of these things with (hopefully) minimal per-zone costs. We plan to use that for most components in the system. Can we use it for this one? The point about it not making sense to have sled agent create these out of thin air makes sense, but that feels like a small implementation constraint to incorporate (we already need parameters like this to determine how many of various components should exist in the system), not an architectural problem that prevents us from using the same underlying mechanisms?

Some of the above items do seem related to the more urgent pain points. I think the configuration and the extra deployment (and undeployment) steps have tripped people up a bunch. @rcgoodfellow mentioned at one point having a more unified configuration instead of separate files and steps. This might be an ignorant question, but are there deep reasons that softnpu couldn't be an Omicron package that gets configured in config-rss.toml and deployed by RSS to Sled Agents using the PUT /services endpoint? (This might be complicated enough to merit a call -- or misguided enough to not merit any discussion!)

rcgoodfellow commented 11 months ago

are there deep reasons that softnpu couldn't be an Omicron package that gets configured in config-rss.toml and deployed by RSS

I think @smklein's comment covers this pretty well. Basically, softnpu is a hardware emulator, so think of it as hardware.

The creation of the npuzone tool was meant to solve many of the pain points folks were having. That was integrated in #3576. It takes a lot of the complexity of deploying a softnpu device as a zone out of omicron and captures it in the npuzone tool. It also obviated a bunch of the manual configuration steps. I believe what's left today in terms of things one needs to configure are intrinsic to the external network omicron is running on.

Physical link on the machine to use for external connectivity.
The gateway IP to use (we'll use the default gateway if not specified)
The range of addresses to enable proxy ARP for, which is needed to contact services and instances in the rack if you plan on using IP addresses from a shared subnet (majority of home users).

One of the present gotchas that seem to be causing pain is that when you do an omicron-package --uninstall you need to also do a create/destroy virtual hardware due to lingering state being left on the NPU. That's a bug we intend to fix.

On the first class component list:

it's deployed automatically with the rest of the system (if configured to do so, obviously -- which is not in production)

it's configured the way other components in the system are configured (for better and worse this basically comes from RSS today)

it can be updated using the same APIs and tools as you'd use to update those components, too (once we build those)

it has resource limits applied to it and the resources that it uses (disk, CPU, memory, etc.) get accounted-for like the resources used by these other components (eventually?)

you could create a zone bundle to debug it

it could export metrics that get saved in Clickhouse and could be viewed in the usual ways

Assuming that system in the first bullet refers to Omicron - I think these just don't make sense for a piece of emulated hardware for the reasons @smklein mentions above.

you can see what version is running using the same APIs and tools as you'd use for components like Nexus, CockroachDB, etc. (once we build those)

I think this would be something we'd build into Dendrite as a sort of ASIC version API. This would also cover real hardware. For example, if there were different revs of Tofino 2, we'd expose them here.

its logs get rotated and archived

The logs currently get redirected to a file. But there is no rotation and archiving. I agree this could be nice. That being said there is not a whole lot of logging that SoftNPU does - since it's processing a stream of packets, observability is mostly on-demand through dtrace.

you can use the same tools to inspect/debug it (e.g., svcs, zoneadm, etc.)

zoneadm works today. We could run the softnpu binary as a service inside the zone instead of just running it in the background with logs piped to a file. That would likely be a usability win.

@davepacheco @jordanhendricks @smklein @internet-diglett happy to have a quick call on this to provide more color etc.

davepacheco commented 11 months ago

Basically, softnpu is a hardware emulator, so think of it as hardware. ... I think these just don't make sense for a piece of emulated hardware for the reasons @smklein mentions above.

I didn't mean to ignore this point but I don't follow why the function (emulating a piece of hardware) is relevant here. Is it just that this is not part of the shipping product? These functions (build, package, configuration, deployment, debugging) are all necessary, however they wind up implemented. The thought behind using the same APIs and tooling was that hopefully that'd be easy to do (maybe it's not) and in doing so we could leverage common efforts to provide just one set of tools for people to use. Part of my interest here is that I'd like us to be able to use the same mechanisms for any future dev-only components we have, especially if that allows us to consolidate efforts on the tooling to manage component lifecycles. (There may be other reasons why this is hard for softnpu but it doesn't sound like that's the objection here.)

Also to be clear I didn't know about npuzone or any plans in this area when we first talked about this idea so this isn't a criticism of anything we've done here. It may all be moot now if the pain points are addressed.

internet-diglett commented 11 months ago

I think to refocus the discussion, it's important to note that @jordanhendricks stated:

I was mostly filing this as a placeholder from incremental work laid out in RFD 411.

I think we may have prematurely tunnel visioned in on a specific interpretation of the issue and accidentally missed the general theme.

Essentially, there's a lot of functionality jammed into bash scripts that are becoming increasingly brittle. I think the npuzone tooling has reduced this some, and falls in line with the theme of "do it in Rust instead of Bash please", so I'm now interpreting this issue as "could we package and configure the zone using some of our existing rust based libraries instead of a bunch of text files" which I'm not sure if npuzone does already.

oxidecomputer / omicron

softnpu could be packaged as a zone, like other omicron components #4131