Closed gmbnomis closed 3 weeks ago
thanks for the report, @soloestoy please also take a look.
This is an interesting scenario, and the initial solution seems to be able to maintain the "client pause write" state until FAILOVER_IN_PROGRESS
ends. Additionally, there may also be a similar issue when performing cluster failover
in cluster mode, right?
Additionally, there may also be a similar issue when performing cluster failover in cluster mode, right?
The redirect for CME is based on whether the node owns the slot, not whether the node is a primary. So, there shouldn't be a loop since one of the node always owns the slot.
This is an interesting scenario, and the initial solution seems to be able to maintain the "client pause write" state until FAILOVER_IN_PROGRESS ends
This generally sounds about right. We should be entering a pause situation on the primary instead of redirecting duing the failover.
Just to avoid a misunderstanding (I may not have made this clear enough above): During standalone FAILOVER, writing clients will be paused during the entire failover and paused clients are unpaused when the failover finishes (successfully or not).
However, if a client issues the first write command since the beginning of the failover during the time window described above (or a read command in the case of the stale-replica condition), the order of the conditions in processCommand
(redirect, ..., readonly, ..., stale-replica, ..., block client during pause) actually makes the client pause ineffective in some situations.
So, we might want to check which of those error replies we don't want to send during a failover. Or we could re-order conditions (or do a combination of the two). However, re-ordering looks too scary to me to even give it a try...
(I am not familiar with clustered setups, so I can't say anything about these. I looked into failover handling in more detail because of the sentinel improvement issue https://github.com/redis/redis/issues/13118)
I have checked cluster failover
in the cluster mode. Unlike in standalone mode, in cluster failover
, it is initiated by the replica:
cluster failover
command is executed on the replica;CLUSTERMSG_TYPE_MFSTART
message to the primary and enters manual failover state;CLUSTERMSG_TYPE_MFSTART
message, the primary enters a pause write state and sets a timeout, then responds to the replica with its offset (along with the CLUSTERMSG_FLAG0_PAUSED
flag); It can be seen that under normal circumstances, the primary does not unpause until the slot is taken over by its replica and it becomes the new primary. Therefore, in standalone mode, the unpause operation should be performed after receiving the reply to the PSYNC FAILOVER
command sent to the replica.
if (server.failover_state == FAILOVER_IN_PROGRESS) {
if (psync_result == PSYNC_CONTINUE || psync_result == PSYNC_FULLRESYNC) {
clearFailoverState();
} else {
abortFailover("Failover target rejected psync request");
return;
}
}
We are already doing it this way now, @gmbnomis can you provide more details about your issue?
@soloestoy Yes, the locations of pausing and unpausing clients are correct. However, "pausing clients" means that there is a flag indicating that clients issuing new commands (after the pause started) should be blocked.
Actually blocking clients while paused happens here in processCommand
:
https://github.com/valkey-io/valkey/blob/59aa00823c3691c50da452f2df2c4bc24e46696d/src/server.c#L4061
And that check happens later in processCommand
than e.g. the check for redirecting:
https://github.com/valkey-io/valkey/blob/59aa00823c3691c50da452f2df2c4bc24e46696d/src/server.c#L3903
As you said, clients are paused during the FAILOVER_IN_PROGRESS
state. But this does not mean that they are already in blocked state. So:
processCommand
will reply with a REDIRECT, because the part of processCommand
that is responsible for blocking the client isn't reached.I found a way to force FAILOVER_WAIT_FOR_SYNC
and FAILOVER_IN_PROGRESS
respectively by letting the replica sleep (taken from the failover tests).
https://github.com/gmbnomis/valkey/commit/9a26faeb8b885d77e326ce43a7a63627fdfa19c0 is a test that:
demonstrates the erroneous REDIRECT response (instead of blocking the client) and the MASTERDOWN response if replica-serve-stale-data no
is configured.
It also demonstrates another issue: When entering FAILOVER_IN_PROGRESS
, the call to disconnectAllBlockedClients
already disconnects blocked clients that were executing a blocking command with an UNBLOCKED
response.
I think this has two problems:
Should I open a separate issue for this case?
verifies the correct responses in FAILOVER_WAIT_FOR_SYNC
Hi @gmbnomis , I think I understand your point. The main issue is during the FAILOVER_IN_PROGRESS
period, both nodes may be replicas, causing -REDIRECT
to bounce back and forth between the two replicas.
I have seen your PR, but it seems too complicated and does not completely solve the problem (please correct me if I am wrong). I believe a simple and effective approach would be to, when the primary becomes a replica, i.e., during the FAILOVER_IN_PROGRESS
state, replace the behavior that should return -REDIRECT
with pausing the client instead. Then, once the replica truly becomes the primary, i.e., after receiving the response to PSYNC FAILOVER
, the suspended clients' commands will be resumed.
Hi @soloestoy,
I have seen your PR, but it seems too complicated and does not completely solve the problem (please correct me if I am wrong).
my PR solves a different problem (handling of clients that are already blocked (not postponed) before the failover begins). So yes, it does not solve the problem described here.
Describe the bug
A client using the new
CLIENT CAPA redirect
feature may see bouncingREDIRECT
responses during a failover initiated by theFAILOVER
command. To be more precise, during theFAILOVER_IN_PROGRESS
failover state.While the current primary is in
FAILOVER_IN_PROGRESS
state,server.primary_host
is already set to the designated primary. However, the designated primary still acts as a replica until thePSYNC FAILOVER
command is processed.The primary will already reply with a
REDIRECT
(to the designated primary) in this situation. But the latter is still a replica and willREDIRECT
the client back to the old primary.To reproduce
This is hard to reproduce since, under normal circumstances, the duration of the
FAILOVER_IN_PROGRESS
state is very short: During this phase, a connection from the primary to the designated primary is opened to send, among other commands, thePSYNC FAILOVER
command.Nevertheless, in case of e.g. a network glitch, this may take much longer (actually, there seems to be no timeout for this phase(?))
I produced this behavior by blocking the client when receiving the
PSYNC FAILOVER
insyncCommand
.Update: The behavior can also be reproduced by letting the replica sleep, see https://github.com/gmbnomis/valkey/commit/ca708a424c7df482344401ec67f26263074996c7 for a demonstration.
Expected behavior
During the first phase of a failover (
FAILOVER_WAIT_FOR_SYNC
), a writing client will be paused while reads can still proceed. During the second phase (FAILOVER_IN_PROGRESS
) the same behavior is expected.Only after finishing the failover procedure, i.e. when the clients are unpaused again,
REDIRECT
replies should be sent (since the new primary is ready to accept writes now.)Additional information
Note that it isn't sufficient to just ensure that we aren't in a "failover" when sending redirect at https://github.com/valkey-io/valkey/blob/59aa00823c3691c50da452f2df2c4bc24e46696d/src/server.c#L3903, e.g.:
Instead of
REDIRECT
, the client will seeREADONLY
errors duringFAILOVER_IN_PROGRESS
(caused by https://github.com/valkey-io/valkey/blob/59aa00823c3691c50da452f2df2c4bc24e46696d/src/server.c#L3987), which as explained in #319, prevents a smooth switch.We could add a corresponding
server.failover_state == NO_FAILOVER
clause to the respectiveif
. However, it looks likeMASTERDOWN
could be triggered duringFAILOVER_IN_PROGRESS
as well ifreplica-serve-stale-data
isno
(I did not test this).It may also be worth considering whether we should change the order of the if clauses (i.e. pause clients "earlier").