wicket may fail to show "Downloading installinator" progress if the other scrimlet is serving the same host image

jgallagher commented 1 year ago

If we follow this sequence:

Perform a mupdate via one scrimlet (call this "scrimlet 14")
Leave scrimlet 14 (in particular, its switch zone) untouched - MGS will continue to hold the phase 2 recovery OS image in-memory to serve to any SP that requests it
Begin a mupdate that uses the identical recovery OS image via the other scrimlet ("scrimlet 16")

it's possible that the mupdate in step 3 will show no progress information during the "Downloading installinator, waiting for it to start step", even though installinator is actually being downloaded successfully.

As of the beginning of the mupdate via scrimlet 16 in step 3, the MGS instances on both scrimlets have the installinator OS image and are able to serve it to any SP that asks for it over the management network. When the SP fetches data from MGS, it sends a multicast packet and will "attach" to whichever MGS instance it finds first, unless that instance stops responding (in which case it will start looking for either MGS instance again).

In step 3, the wicketd running on scrimlet 16 is asking its local MGS (literally via localhost) for progress information. But if the SP of the sled being updated attaches to the MGS on scrimlet 14, there will be no progress information available.

Today while in this situation on the dogfood rack, we shutdown the MGS service (via svcadm disable mgs) on scrimlet 14, and more or less immediately started seeing progress, because the SP flipped over and attached to the MGS on scrimlet 16. Re-enabling the service on scrimlet 14 did not affect continued progress messages, as the SP stayed attached to the scrimlet 16 MGS.

I'm not sure whether or not there's really a bug here: in normal operation I think we would not expect both MGS instances to be serving identical host images at the same time. If we decide we do want this situation to work and show progress, it may be tricky: wicketd has no way to talk to MGS on the other scrimlet (MGS only listens on localhost and, once the control plane is up, the underlay network, and wicketd only talks on localhost and the bootstrap network).

sunshowers commented 1 year ago

Summarizing a discussion between John and myself:

This is unlikely but possible

The specific condition that caused this to happen in the dogfood rack was the fact that scrimlet 14 is treated as special and not considered part of the cluster during updates.

The typical flow is:

Technician connects to scrimlet 14.
They mupdate all gimlets other than 14. This causes the MGS running on scrimlet 14 to serve the phase 2 trampoline image.
The technician then connects to scrimlet 16.
Finally, they mupdate scrimlet 14.

At step 4, the MGS running on scrimlet 16 starts serving the phase 2 trampoline image as well. However, while mupdating scrimlet 14 it is rebooted, which means that it can no longer serve phase 2 trampoline images. So at any time at most one MGS serves up the artifact.

A scenario where we can hit this issue. "Revision 1" and "revision 2" stand for two versions of the Oxide TUF repo, that are possibly the same, but at least share the same phase 2 trampoline image[^1].

Technician connects to scrimlet 14, mupdating everything to revision 1.
At some point in the future, the technician connects to scrimlet 16, mupdating at least one gimlet, not including scrimlet 14, to revision 2.
MGS is never restarted at any point in the middle (it's not clear to me how often and under what circumstances MGS is restarted).

This is quite contrived because in general we'd be mupdating the entire rack rather than a single sled or two. However, this scenario is possible if:

A sled in the field is defective, and we're replacing it. To do that today we have to mupdate the replacement gimlet.
For the initial mupdate and the subsequent mupdate, the technician picked two different scrimlets[^2].

[^1]: A design goal of the phase 2 trampoline image (see RFD 345) is that it is updated more rarely than the main Oxide components. We aren't there today, since we're still picking up OS updates for now, but hope to be there in the future.

[^2]: This is mitigated if we suggest to customers to preferentially use one of the two scrimlets. This is going to depend on how customers set things up in the field: for example, if they use a jump box plugged into one of the technician ports, this is less of an issue than if they use a jump box plugged into both ports.

All in all, this is quite unlikely to happen in the field, but is still possible -- and we should be on the lookout for customer reports.

How it manifests

If this happens, it doesn't actually cause breakage -- the UI just stops updating for a while.

However, the specific step at which the UI stops updating -- "downloading installinator" -- is by far the longest step in the process, taking around 80-85% of the overall time to mupdate.

How to fix this

Currently, wicketd asks whatever MGS is running on localhost about the last offset fetched by the SP. In reality, it is the host fetching data via the SP.

Instead, wicketd could ask the SP (via MGS) about the last offset fetched by the host. The SP can track this internally and expose it via an API. This alternative approach would solve the issue.

jgallagher commented 1 year ago

Thanks for writing that up!

I tentatively tagged this with MVP+1; I think right now we're unlikely to hit this in most environments, but we could make changes (like shipping multiple updates with the same installinator OS image) that would make it more likely.

davepacheco commented 1 year ago

Cool! For what it's worth, I think there's also cost to us running into this in various development environments. In dev/test, the update sequences are potentially a lot more varied. When we initially hit this, we all assumed something was wrong, that the process was hung, and we started trying to debug it. (not saying this is super high priority either)

oxidecomputer / omicron