Open wesolows opened 1 year ago
For MVP, I'm more inclined to do nothing (option 1) because we'll likely have to shepherd/manage all HW and SW changes initially with the first few customers. Post-MVP, option 3 seems more preferable (since option 2 is framed as a hack and also comes with other schedule tradeoff) but it is not clear if there is any downside from a functional perspective, aside from schedule tradeoff.
One additional thought here: if we decide not to fix this before RR (or can't for some reason), we do have an alternative to rebooting sleds when a disk is hot-inserted. In particular, we noted in our call yesterday that giving operators a "reboot sled" API primitive isn't really consistent with the level of abstraction we're trying to provide. If we did construct such an API call, it would later have to be deprecated, which would be a breaking change for any operators that may have integrated that endpoint into their workflows. While communicating the uncommitted nature of such an API call is helpful, better still is to avoid the need for such breaking changes.
The alternative for addressing stlouis#94 would be to create a temporary API primitive to configure a disk slot (or all disk slots) using the libcfgadm interface in illumos -- this is the same thing that happens if one runs cfgadm on the command line. The appeal of this approach is that deprecating this API later on can be done without breaking anything: it simply becomes a nop once automatic onlining of hot-inserted disks is working properly.
At this point it looks like we'll probably end up with solution 3 here at RR, but this entire area still needs more long-term architectural work. Discussion of the options is in RFD 384.
Option 3 from the original discussion was implemented as oxidecomputer/stlouis#94 and RFD 384 was accordingly marked Committed when that was delivered. At some point we should revisit the ideal long-term solution, whatever it might be, but this should work on all rev D and newer Gimlets. If that's not happening, there is either some other sled-agent issue or a remnant OS bug that's not in the database.
See discussion in oxidecomputer/stlouis#94, specifically https://github.com/oxidecomputer/stlouis/issues/94#issuecomment-1496594723. We need to choose collectively an architectural path forward here, both short-term and long-term, so one possible outcome is that this ticket will be closed without any work required in sled-agent. There are several options here:
Some additional hybrid and/or interim-path solutions may exist. The above does not specifically consider what happens if sled-agent (and/or other userland functionality, including userland system software if applicable) is unable to run, which needs additional design work. I plan to write a very brief RFD on this, so this isn't the place to choose our path. Instead, I'm opening this ticket to track the potential for upstack software work in this area and ensure adequate visibility for MVP definition vs. schedule vs. staffing priority calls and possible effects on other sled-agent engineering choices. As there is little documentation covering sled-agent's intended functions or architecture I do not have good visibility into that aspect of this problem. Note that this behaviour is not specific to Oxide hardware so it technically exists on PC-based stand-ins also; if that's considered an important environment going forward (possible but not recommended), the long-term solution should take that into consideration.