Open smklein opened 1 year ago
We've seen at least two incidents of failed instance_create
unwind on rack2 resulting in panic. One was #3212 and another was caused by a networking issue which prevented the vnic from being cleaned up:
BRM42220031 # cat /var/svc/log/oxide-sled-agent\:default.log | looker -l error
WARNING: Failed to delete OPTE port overlay VNIC while dropping port. The VNIC will not be cleaned up properly, and the xde device itself will not be deleted. Both the VNIC and the xde device must be deleted out of band, and it will not be possible to recreate the xde device until then. Error: DeleteVnicError { name: "vopte4", err: CommandFailure(CommandFailureInfo { command: "/usr/sbin/dladm delete-vnic vopte4", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "dladm: vnic deletion failed: link busy\n" }) }
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: AdmError { op: Uninstall, zone: "oxz_propolis-server_2df3f0da-9b07-47bd-85de-a89fc640383f", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_2df3f0da-9b07-47bd-85de-a89fc640383f': uninstall operation is invalid for shutting_down zones.")) }', sled-agent/src/instance.rs:476:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ May 26 17:59:33 Stopping because all processes in service exited. ]
[ May 26 17:59:33 Executing stop method (:kill). ]
[ May 26 17:59:33 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
This issue tracks testing that if we fail a particular action in a saga, we can safely unwind (performing undo actions) and leave the system in a "clean" state.
If other sagas are added without unwind safety tests, they should be added to this list.
Related to https://github.com/oxidecomputer/omicron/issues/1799, https://github.com/oxidecomputer/omicron/issues/2094