oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 39 forks source link

Instances no longer transition to failed state when propolis zone has crashed or is gone #4709

Closed askfongjojo closed 10 months ago

askfongjojo commented 10 months ago

The behavior change appears to be related to https://github.com/oxidecomputer/omicron/pull/4682. Prior to that, instances were moved to failed state by sled-agent or when user attempted to reboot the instance.

The ability to handle propolis zone gone situation may not be by design (though we rely on it quite a bit now during sled planned or unplanned reboot). The ability to handle propolis zone crashing due to unhandled hypervisor or OS exception is probably something the system should have.

gjcolombo commented 10 months ago

Agreed that this is an unforeseen consequence of #4682. Previously, Nexus::instance_reboot could assume that instance_request_state would call handle_instance_put_result, which would handle these sorts of errors by transitioning instances to Failed. After #4682, callers of instance_request_state have to decide how to handle state change failures themselves, and instance_reboot does that by just using the ? operator to kick errors up to the caller.

Using a similar pattern to instance_clear_migration_ids (check to see if the state change request returned an error; if it's a SledAgentInstancePutError and instance_unhealthy() returns true, call mark_instance_failed before returning the error) should restore the old behavior.

The ability to handle propolis zone gone situation may not be by design (though we rely on it quite a bit now during sled planned or unplanned reboot). The ability to handle propolis zone crashing due to unhandled hypervisor or OS exception is probably something the system should have.

This is tracked by #3206.