Closed askfongjojo closed 10 months ago
Agreed that this is an unforeseen consequence of #4682. Previously, Nexus::instance_reboot
could assume that instance_request_state
would call handle_instance_put_result
, which would handle these sorts of errors by transitioning instances to Failed. After #4682, callers of instance_request_state
have to decide how to handle state change failures themselves, and instance_reboot
does that by just using the ?
operator to kick errors up to the caller.
Using a similar pattern to instance_clear_migration_ids
(check to see if the state change request returned an error; if it's a SledAgentInstancePutError
and instance_unhealthy()
returns true
, call mark_instance_failed
before returning the error) should restore the old behavior.
The ability to handle propolis zone gone situation may not be by design (though we rely on it quite a bit now during sled planned or unplanned reboot). The ability to handle propolis zone crashing due to unhandled hypervisor or OS exception is probably something the system should have.
This is tracked by #3206.
The behavior change appears to be related to https://github.com/oxidecomputer/omicron/pull/4682. Prior to that, instances were moved to failed state by sled-agent or when user attempted to reboot the instance.
The ability to handle propolis zone gone situation may not be by design (though we rely on it quite a bit now during sled planned or unplanned reboot). The ability to handle propolis zone crashing due to unhandled hypervisor or OS exception is probably something the system should have.