oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

[nexus] Reincarnate instances with `SagaUnwound` VMMs #6669

Closed hawkw closed 1 month ago

hawkw commented 1 month ago

When an instance-start saga unwinds, any VMM it created transitions to the SagaUnwound state. This causes the instance's effective state to appear as Failed in the external API. PR #6503 added functionality to Nexus to automatically restart instances that are in the Failed state ("instance reincarnation"). However, the current instance-reincarnation task will not automatically restart instances whose instance-start sagas have unwound, because such instances are not actually in the Failed state from Nexus' perspective.

This PR implements reincarnation for instances whose instance-start sagas have failed. This is done by changing the instance_reincarnation background task to query the database for instances which have SagaUnwound active VMMs, and then run instance-start sagas for them identically to how it runs start sagas for Failed instances.

I decided to perform two separate queries to list Failed instances and to list instances with SagaUnwound VMMs, because the SagaUnwound query requires a join with the vmm table, and I thought it was a bit nicer to be able to find Failed instances without having to do the join, and only do it when looking for SagaUnwound ones. Also, having two queries makes it easier to distinguish between Failed and SagaUnwound instances in logging and the OMDB status output. This ended up being implemented by adding a parameter to the DataStore::find_reincarnatable_instances method that indicates which category of instances to select; I had previously considered making the method on the InstanceReincarnation struct that finds instances and reincarnates them take the query as a Fn taking the datastore and DataPageParams and returning an impl Future outputting Result<Vec<Instance>, ...>,but figuring out generic lifetimes for the pagination stuff was annoying enough that this felt like the simpler choice.

Fixes #6638

hawkw commented 1 month ago

When merging this, we should also be sure to merge #6658, since otherwise, SagaUnwound instances will appear as Stopped in the external API even though we will automatically restart them, which feels weird

hawkw commented 1 month ago

Well that's extremely spooky, it looks like this worked fine on commit 0b7f72efe2496a467b7866673de42dcff8621883 but then somehow broke on commit 8f89106b26f05dad3ecda33a936b542e389660bd: https://buildomat.eng.oxide.computer/wg/0/details/01J8T6F6B4TYVZVGS9NVY6RXJ8/m4ivC9CI7YNrcLE1S1dUTosEDmIl3bfax3fd4qNVIe7XiKua/01J8T6G2PKB9TADGAMV5DAR8R8

hawkw commented 1 month ago

(also, it occurred to me that we probably want to make unwinding start sagas check if they should immediately kick the reincarnation task...)

hawkw commented 1 month ago

Uhh, what: https://buildomat.eng.oxide.computer/wg/0/artefact/01J8T6F6B4TYVZVGS9NVY6RXJ8/m4ivC9CI7YNrcLE1S1dUTosEDmIl3bfax3fd4qNVIe7XiKua/01J8T6G2PKB9TADGAMV5DAR8R8/01J8TBRXKQRMR054WB2ZGFZXZ9/omicron_nexus-c2ed03e721299be0-test_only_reincarnates_eligible_instances.10742.0.log?format=x-bunyan#L18215

hawkw commented 1 month ago

Aaaand it passes on my machine:

    Finished `test` profile [unoptimized + debuginfo] target(s) in 3m 35s
------------
 Nextest run ID a90e00f1-faac-4db4-82a3-f32ca87dd2bd with nextest profile: default
    Starting 3 tests across 162 binaries (1536 tests and 5 binaries skipped, including 5 binaries via profile.default.default-filter)
       SETUP [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
             [ 00:00:00] [                   ]    0/1539:      
   Compiling nexus-config v0.1.0 (/home/eliza/Code/oxide/omicron/nexus-config)
   Compiling omicron-test-utils v0.1.0 (/home/eliza/Code/oxide/omicron/test-utils)
   Compiling crdb-seed v0.1.0 (/home/eliza/Code/oxide/omicron/dev-tools/crdb-seed)
    Finished `test` profile [unoptimized + debuginfo] target(s) in 3.76s
     Running `target/debug/crdb-seed`
Sep 27 18:54:25.474 INFO Using existing CRDB seed tarball: `/tmp/crdb-base-eliza/7888c2fb782f3500cf5404b9680c8152f4554f551acee683d7a98ec218b57e57.tar`
  SETUP PASS [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
        PASS [  14.789s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_reincarnates_failed_instances
        PASS [  28.533s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_cooldown_on_subsequent_reincarnations
        PASS [  29.511s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_only_reincarnates_eligible_instances
------------
     Summary [  33.341s] 3 tests run: 3 passed, 1536 skipped

I bet this is a race between periodic and explicit activations of the reincarnation task. Cool.