oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 39 forks source link

Nexus should restart `Failed` instances when `boot_on_fault` says to #6491

Closed hawkw closed 1 month ago

hawkw commented 2 months ago

Depends on #6455 (and probably also #6490).

Per RFD 486:

An instance’s boot_on_fault discipline tells Nexus whether to try to recover after retiring a failed VMM. The options are to do nothing (the default) or to try to restart the instance automatically.

We should implement that.

Potentially, we could attempt to schedule a new start saga for an instance as part of the update saga that transitions it to Failed. However, regardless of whether or not we do that, there should definitely be a RPW that's responsible for periodically listing instances which are in the Failed state and have boot_on_fault disciplines indicating that they should be restarted, and ensure that a start saga is started for those instances. Update sagas which have transitioned an instance to Failed could just activate that background task.

gjcolombo commented 2 months ago

Related: #4872

askfongjojo commented 1 month ago

@hawkw - Is this considered done? Or we're using this issue to track the future work of making boot_on_fault configurable by user? (there may already be a ticket for that but I haven't located that yet)

hawkw commented 1 month ago

This is done --- can't believe I opened this issue and forgot to close it. Whoops!