oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

consider sled-agent implementation for stlouis#94 #2764

Open wesolows opened 1 year ago

wesolows commented 1 year ago

See discussion in oxidecomputer/stlouis#94, specifically https://github.com/oxidecomputer/stlouis/issues/94#issuecomment-1496594723. We need to choose collectively an architectural path forward here, both short-term and long-term, so one possible outcome is that this ticket will be closed without any work required in sled-agent. There are several options here:

  1. Do nothing in the short term, which will mean that service actions involving insertion of a PCIe device (SSD or SC) will require the Gimlet in question to be rebooted before the newly-attached device will function. This would require documentation in the form of a highly visible product-level erratum. In the long term, implement one of the remaining options.
  2. In the short term, have sled-agent detect the insertion of an SSD via hacks (in a path similar to existing/planned hacks for detection of configured storage devices) and configure the attachment point to bring the device online. In this scenario, Sidecar attachment can be disregarded as the short-term service action for SC-Scrimlet changes is already complex and expected to be infrequent. In the long term, implement one of the remaining options.
  3. In the short term, have system software forcibly enable all devices on hot-insertion. In the long term, implement one of the remaining options. While this can likely be done fairly quickly, it will contend with other host system software work and will require tradeoffs between the existing schedule and other work. Short-term risks are substantial due to inadequate downstack staffing.
  4. Provide a flexible mechanism to configure system software to automatically enable devices on hot-insertion, with a sysevent and/or FRU monitor based mechanism for detecting auto-enable failures and perhaps diagnose such failures as faults. Configure system software to do this automatically on all oxide arch implementations and/or as part of the Helios build process. Requires generic and Gimlet-specific topo work, including a FRU monitor (see RFD 360). Months of work; requires schedule slip if part of the MVP definition.
  5. Provide FRU monitoring mechanisms in fmd or other system software and provide a sysevent and/or additional interface for upstack software to consume it. As part of this, manage and propagate hot-insertion events into upstack software (sled-agent, on the oxide architecture) that is tasked with implementing policy. This could be done in conjunction with other non-sled-agent implementations serving the same purpose on other architectures. Months of OS work, plus additional sled-agent work; requires schedule slip if part of the MVP definition.

Some additional hybrid and/or interim-path solutions may exist. The above does not specifically consider what happens if sled-agent (and/or other userland functionality, including userland system software if applicable) is unable to run, which needs additional design work. I plan to write a very brief RFD on this, so this isn't the place to choose our path. Instead, I'm opening this ticket to track the potential for upstack software work in this area and ensure adequate visibility for MVP definition vs. schedule vs. staffing priority calls and possible effects on other sled-agent engineering choices. As there is little documentation covering sled-agent's intended functions or architecture I do not have good visibility into that aspect of this problem. Note that this behaviour is not specific to Oxide hardware so it technically exists on PC-based stand-ins also; if that's considered an important environment going forward (possible but not recommended), the long-term solution should take that into consideration.

askfongjojo commented 1 year ago

For MVP, I'm more inclined to do nothing (option 1) because we'll likely have to shepherd/manage all HW and SW changes initially with the first few customers. Post-MVP, option 3 seems more preferable (since option 2 is framed as a hack and also comes with other schedule tradeoff) but it is not clear if there is any downside from a functional perspective, aside from schedule tradeoff.

wesolows commented 1 year ago

One additional thought here: if we decide not to fix this before RR (or can't for some reason), we do have an alternative to rebooting sleds when a disk is hot-inserted. In particular, we noted in our call yesterday that giving operators a "reboot sled" API primitive isn't really consistent with the level of abstraction we're trying to provide. If we did construct such an API call, it would later have to be deprecated, which would be a breaking change for any operators that may have integrated that endpoint into their workflows. While communicating the uncommitted nature of such an API call is helpful, better still is to avoid the need for such breaking changes.

The alternative for addressing stlouis#94 would be to create a temporary API primitive to configure a disk slot (or all disk slots) using the libcfgadm interface in illumos -- this is the same thing that happens if one runs cfgadm on the command line. The appeal of this approach is that deprecating this API later on can be done without breaking anything: it simply becomes a nop once automatic onlining of hot-inserted disks is working properly.

At this point it looks like we'll probably end up with solution 3 here at RR, but this entire area still needs more long-term architectural work. Discussion of the options is in RFD 384.

wesolows commented 1 year ago

Option 3 from the original discussion was implemented as oxidecomputer/stlouis#94 and RFD 384 was accordingly marked Committed when that was delivered. At some point we should revisit the ideal long-term solution, whatever it might be, but this should work on all rev D and newer Gimlets. If that's not happening, there is either some other sled-agent issue or a remnant OS bug that's not in the database.