oxidecomputer / propolis

VMM userspace for illumos bhyve
Mozilla Public License 2.0
176 stars 22 forks source link

Crucible backend needs a mechanism to be reconfigured before restarting after failed LM #230

Open gjcolombo opened 2 years ago

gjcolombo commented 2 years ago

Hypothesized repro steps:

  1. Launch a VM that connects to a set of Crucible downstairs with Crucible generation 1.
  2. Start a migration target that will connect to the same downstairs with Crucible generation 2.
  3. Inject a failure into migration immediately after the target activates (note that this can happen even after #155 is fixed if there is any way for the target to fail to start after Crucible activates).

Expected: The source will wait to be moved back to the 'Running' state. Before that happens, the control plane will set the source's Crucible generation to 3 and direct it to reactivate.

Current state: There are two problems to solve here:

leftwo commented 2 years ago

Nexus has no way to update the Crucible generation number of an ensured instance.

I'm working on the Omicron side to enable it to update a volume's generation number each time it is pulled out of the database. I think this might solve this issue, but what exactly to you want to happen to the "ensured instance"? Is this "ensured_instance" in propolis, or Omicron?

gjcolombo commented 2 years ago

It's in Propolis. Specifically this code needs to change: https://github.com/oxidecomputer/propolis/blob/70e019a4e94e936f6ccd32656220a453d7e8c3b1/bin/propolis-server/src/lib/server.rs#L318-L344

Now that I look at it again, this issue is sort of a duplicate of #205, but that issue doesn't track the migration state machine problem I mentioned, so I'll keep this alive at least for that part of the problem.

gjcolombo commented 1 year ago

For MVP, we're consciously going to allow instances to resume immediately after a failed migration out. This is legal because (if we've implemented everything correctly) nothing in the migration protocol destroys state on the source: the entire protocol has to finish before the target can do anything that would render the source unable to resume. This simplifies things considerably up in the control plane.

I think we could maintain this property even with a more complicated migration protocol (e.g. one where the target resumes while still pulling data in from the source), so long as we had a clear point of no return after which it was definitely not safe for the source to resume on its own.

Anyway, triaging this as Unscheduled since we don't really have any concrete plans that need this work right now.