Found this while stress testing downstairs replacement.
When a downstairs is replaced we transition from:
Active -> Replacing -> Replaced.
This causes the client task to restart itself.
When the client task comes back online, it then tries to connect to the
new downstairs. If it does, then it starts negotiation and because the
current state is Replaced, the negotiation steps will allow the new
downstairs to proceed to a point where we set it to LiveRepairReady
and eventually do the repair.
However, if we are in state Replaced and the client task restarts and
it is not able to connect to the new downstairs, it will eventually timeout
and then mark the downstairs as Failed
This setting for Failed means the upstairs now forgets that this downstairs
was Replaced. When the downstairs does come online, and we try to
negotiate again, we find a downstairs with a different UUID than we expect
and (because we forgot this was replaced) we panic.
Found this while stress testing downstairs replacement.
When a downstairs is replaced we transition from:
Active
->Replacing
->Replaced
.This causes the client task to restart itself.
When the client task comes back online, it then tries to connect to the new downstairs. If it does, then it starts negotiation and because the current state is
Replaced
, the negotiation steps will allow the new downstairs to proceed to a point where we set it toLiveRepairReady
and eventually do the repair.However, if we are in state
Replaced
and the client task restarts and it is not able to connect to the new downstairs, it will eventually timeout and then mark the downstairs asFailed
This setting for
Failed
means the upstairs now forgets that this downstairs wasReplaced
. When the downstairs does come online, and we try to negotiate again, we find a downstairs with a different UUID than we expect and (because we forgot this was replaced) we panic.