oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 39 forks source link

Tracking issue for saga unwind safety #2052

Open smklein opened 1 year ago

smklein commented 1 year ago

This issue tracks testing that if we fail a particular action in a saga, we can safely unwind (performing undo actions) and leave the system in a "clean" state.

If other sagas are added without unwind safety tests, they should be added to this list.

Related to https://github.com/oxidecomputer/omicron/issues/1799, https://github.com/oxidecomputer/omicron/issues/2094

askfongjojo commented 1 year ago

We've seen at least two incidents of failed instance_create unwind on rack2 resulting in panic. One was #3212 and another was caused by a networking issue which prevented the vnic from being cleaned up:

BRM42220031 # cat /var/svc/log/oxide-sled-agent\:default.log | looker -l error
WARNING: Failed to delete OPTE port overlay VNIC while dropping port. The VNIC will not be cleaned up properly, and the xde device itself will not be deleted. Both the VNIC and the xde device must be deleted out of band, and it will not be possible to recreate the xde device until then. Error: DeleteVnicError { name: "vopte4", err: CommandFailure(CommandFailureInfo { command: "/usr/sbin/dladm delete-vnic vopte4", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "dladm: vnic deletion failed: link busy\n" }) }
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: AdmError { op: Uninstall, zone: "oxz_propolis-server_2df3f0da-9b07-47bd-85de-a89fc640383f", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_2df3f0da-9b07-47bd-85de-a89fc640383f': uninstall operation is invalid for shutting_down zones.")) }', sled-agent/src/instance.rs:476:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ May 26 17:59:33 Stopping because all processes in service exited. ]
[ May 26 17:59:33 Executing stop method (:kill). ]
[ May 26 17:59:33 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]