Open jgallagher opened 1 year ago
Summarizing a discussion between John and myself:
The specific condition that caused this to happen in the dogfood rack was the fact that scrimlet 14 is treated as special and not considered part of the cluster during updates.
The typical flow is:
At step 4, the MGS running on scrimlet 16 starts serving the phase 2 trampoline image as well. However, while mupdating scrimlet 14 it is rebooted, which means that it can no longer serve phase 2 trampoline images. So at any time at most one MGS serves up the artifact.
A scenario where we can hit this issue. "Revision 1" and "revision 2" stand for two versions of the Oxide TUF repo, that are possibly the same, but at least share the same phase 2 trampoline image[^1].
This is quite contrived because in general we'd be mupdating the entire rack rather than a single sled or two. However, this scenario is possible if:
[^1]: A design goal of the phase 2 trampoline image (see RFD 345) is that it is updated more rarely than the main Oxide components. We aren't there today, since we're still picking up OS updates for now, but hope to be there in the future.
[^2]: This is mitigated if we suggest to customers to preferentially use one of the two scrimlets. This is going to depend on how customers set things up in the field: for example, if they use a jump box plugged into one of the technician ports, this is less of an issue than if they use a jump box plugged into both ports.
All in all, this is quite unlikely to happen in the field, but is still possible -- and we should be on the lookout for customer reports.
If this happens, it doesn't actually cause breakage -- the UI just stops updating for a while.
However, the specific step at which the UI stops updating -- "downloading installinator" -- is by far the longest step in the process, taking around 80-85% of the overall time to mupdate.
Currently, wicketd asks whatever MGS is running on localhost about the last offset fetched by the SP. In reality, it is the host fetching data via the SP.
Instead, wicketd could ask the SP (via MGS) about the last offset fetched by the host. The SP can track this internally and expose it via an API. This alternative approach would solve the issue.
Thanks for writing that up!
I tentatively tagged this with MVP+1; I think right now we're unlikely to hit this in most environments, but we could make changes (like shipping multiple updates with the same installinator OS image) that would make it more likely.
Cool! For what it's worth, I think there's also cost to us running into this in various development environments. In dev/test, the update sequences are potentially a lot more varied. When we initially hit this, we all assumed something was wrong, that the process was hung, and we started trying to debug it. (not saying this is super high priority either)
If we follow this sequence:
it's possible that the mupdate in step 3 will show no progress information during the "Downloading installinator, waiting for it to start step", even though installinator is actually being downloaded successfully.
As of the beginning of the mupdate via scrimlet 16 in step 3, the MGS instances on both scrimlets have the installinator OS image and are able to serve it to any SP that asks for it over the management network. When the SP fetches data from MGS, it sends a multicast packet and will "attach" to whichever MGS instance it finds first, unless that instance stops responding (in which case it will start looking for either MGS instance again).
In step 3, the wicketd running on scrimlet 16 is asking its local MGS (literally via
localhost
) for progress information. But if the SP of the sled being updated attaches to the MGS on scrimlet 14, there will be no progress information available.Today while in this situation on the dogfood rack, we shutdown the MGS service (via
svcadm disable mgs
) on scrimlet 14, and more or less immediately started seeing progress, because the SP flipped over and attached to the MGS on scrimlet 16. Re-enabling the service on scrimlet 14 did not affect continued progress messages, as the SP stayed attached to the scrimlet 16 MGS.I'm not sure whether or not there's really a bug here: in normal operation I think we would not expect both MGS instances to be serving identical host images at the same time. If we decide we do want this situation to work and show progress, it may be tricky: wicketd has no way to talk to MGS on the other scrimlet (MGS only listens on
localhost
and, once the control plane is up, the underlay network, and wicketd only talks onlocalhost
and the bootstrap network).