Open gjcolombo opened 2 years ago
Nexus has no way to update the Crucible generation number of an ensured instance.
I'm working on the Omicron side to enable it to update a volume's generation number each time it is pulled out of the database. I think this might solve this issue, but what exactly to you want to happen to the "ensured instance"? Is this "ensured_instance" in propolis, or Omicron?
It's in Propolis. Specifically this code needs to change: https://github.com/oxidecomputer/propolis/blob/70e019a4e94e936f6ccd32656220a453d7e8c3b1/bin/propolis-server/src/lib/server.rs#L318-L344
Now that I look at it again, this issue is sort of a duplicate of #205, but that issue doesn't track the migration state machine problem I mentioned, so I'll keep this alive at least for that part of the problem.
For MVP, we're consciously going to allow instances to resume immediately after a failed migration out. This is legal because (if we've implemented everything correctly) nothing in the migration protocol destroys state on the source: the entire protocol has to finish before the target can do anything that would render the source unable to resume. This simplifies things considerably up in the control plane.
I think we could maintain this property even with a more complicated migration protocol (e.g. one where the target resumes while still pulling data in from the source), so long as we had a clear point of no return after which it was definitely not safe for the source to resume on its own.
Anyway, triaging this as Unscheduled since we don't really have any concrete plans that need this work right now.
Hypothesized repro steps:
Expected: The source will wait to be moved back to the 'Running' state. Before that happens, the control plane will set the source's Crucible generation to 3 and direct it to reactivate.
Current state: There are two problems to solve here: