oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
166 stars 17 forks source link

crucible upstairs fails to replace a downstairs if the new downstairs is unavailable #1425

Closed leftwo closed 1 month ago

leftwo commented 1 month ago

Found this while stress testing downstairs replacement.

When a downstairs is replaced we transition from: Active -> Replacing -> Replaced.

This causes the client task to restart itself.

When the client task comes back online, it then tries to connect to the new downstairs. If it does, then it starts negotiation and because the current state is Replaced, the negotiation steps will allow the new downstairs to proceed to a point where we set it to LiveRepairReady and eventually do the repair.

However, if we are in state Replaced and the client task restarts and it is not able to connect to the new downstairs, it will eventually timeout and then mark the downstairs as Failed

This setting for Failed means the upstairs now forgets that this downstairs was Replaced. When the downstairs does come online, and we try to negotiate again, we find a downstairs with a different UUID than we expect and (because we forgot this was replaced) we panic.

leftwo commented 1 month ago

Fixed in 1426