Closed hawkw closed 1 month ago
When merging this, we should also be sure to merge #6658, since otherwise,
SagaUnwound
instances will appear as Stopped
in the external API even though
we will automatically restart them, which feels weird
Well that's extremely spooky, it looks like this worked fine on commit 0b7f72efe2496a467b7866673de42dcff8621883 but then somehow broke on commit 8f89106b26f05dad3ecda33a936b542e389660bd: https://buildomat.eng.oxide.computer/wg/0/details/01J8T6F6B4TYVZVGS9NVY6RXJ8/m4ivC9CI7YNrcLE1S1dUTosEDmIl3bfax3fd4qNVIe7XiKua/01J8T6G2PKB9TADGAMV5DAR8R8
(also, it occurred to me that we probably want to make unwinding start sagas check if they should immediately kick the reincarnation task...)
Aaaand it passes on my machine:
Finished `test` profile [unoptimized + debuginfo] target(s) in 3m 35s
------------
Nextest run ID a90e00f1-faac-4db4-82a3-f32ca87dd2bd with nextest profile: default
Starting 3 tests across 162 binaries (1536 tests and 5 binaries skipped, including 5 binaries via profile.default.default-filter)
SETUP [ 1/1] crdb-seed: cargo run -p crdb-seed --profile test
[ 00:00:00] [ ] 0/1539:
Compiling nexus-config v0.1.0 (/home/eliza/Code/oxide/omicron/nexus-config)
Compiling omicron-test-utils v0.1.0 (/home/eliza/Code/oxide/omicron/test-utils)
Compiling crdb-seed v0.1.0 (/home/eliza/Code/oxide/omicron/dev-tools/crdb-seed)
Finished `test` profile [unoptimized + debuginfo] target(s) in 3.76s
Running `target/debug/crdb-seed`
Sep 27 18:54:25.474 INFO Using existing CRDB seed tarball: `/tmp/crdb-base-eliza/7888c2fb782f3500cf5404b9680c8152f4554f551acee683d7a98ec218b57e57.tar`
SETUP PASS [ 1/1] crdb-seed: cargo run -p crdb-seed --profile test
PASS [ 14.789s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_reincarnates_failed_instances
PASS [ 28.533s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_cooldown_on_subsequent_reincarnations
PASS [ 29.511s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_only_reincarnates_eligible_instances
------------
Summary [ 33.341s] 3 tests run: 3 passed, 1536 skipped
I bet this is a race between periodic and explicit activations of the reincarnation task. Cool.
When an
instance-start
saga unwinds, any VMM it created transitions to theSagaUnwound
state. This causes the instance's effective state to appear asFailed
in the external API. PR #6503 added functionality to Nexus to automatically restart instances that are in theFailed
state ("instance reincarnation"). However, the current instance-reincarnation task will not automatically restart instances whose instance-start sagas have unwound, because such instances are not actually in theFailed
state from Nexus' perspective.This PR implements reincarnation for instances whose
instance-start
sagas have failed. This is done by changing theinstance_reincarnation
background task to query the database for instances which haveSagaUnwound
active VMMs, and then runinstance-start
sagas for them identically to how it runs start sagas forFailed
instances.I decided to perform two separate queries to list
Failed
instances and to list instances withSagaUnwound
VMMs, because theSagaUnwound
query requires a join with thevmm
table, and I thought it was a bit nicer to be able to findFailed
instances without having to do the join, and only do it when looking forSagaUnwound
ones. Also, having two queries makes it easier to distinguish betweenFailed
andSagaUnwound
instances in logging and the OMDB status output. This ended up being implemented by adding a parameter to theDataStore::find_reincarnatable_instances
method that indicates which category of instances to select; I had previously considered making the method on theInstanceReincarnation
struct that finds instances and reincarnates them take the query as aFn
taking the datastore andDataPageParams
and returning animpl Future
outputtingResult<Vec<Instance>, ...>
,but figuring out generic lifetimes for the pagination stuff was annoying enough that this felt like the simpler choice.Fixes #6638