[BUG] `CLIENT CAPA redirect` may result in bouncing redirects during failover

gmbnomis commented 1 month ago

Describe the bug

A client using the new CLIENT CAPA redirect feature may see bouncing REDIRECT responses during a failover initiated by the FAILOVER command. To be more precise, during the FAILOVER_IN_PROGRESS failover state.

While the current primary is in FAILOVER_IN_PROGRESS state, server.primary_host is already set to the designated primary. However, the designated primary still acts as a replica until the PSYNC FAILOVER command is processed.

The primary will already reply with a REDIRECT (to the designated primary) in this situation. But the latter is still a replica and will REDIRECT the client back to the old primary.

To reproduce

This is hard to reproduce since, under normal circumstances, the duration of the FAILOVER_IN_PROGRESS state is very short: During this phase, a connection from the primary to the designated primary is opened to send, among other commands, the PSYNC FAILOVER command.

Nevertheless, in case of e.g. a network glitch, this may take much longer (actually, there seems to be no timeout for this phase(?))

I produced this behavior by blocking the client when receiving the PSYNC FAILOVER in syncCommand.

Update: The behavior can also be reproduced by letting the replica sleep, see https://github.com/gmbnomis/valkey/commit/ca708a424c7df482344401ec67f26263074996c7 for a demonstration.

Expected behavior

During the first phase of a failover (FAILOVER_WAIT_FOR_SYNC), a writing client will be paused while reads can still proceed. During the second phase (FAILOVER_IN_PROGRESS) the same behavior is expected.

Only after finishing the failover procedure, i.e. when the clients are unpaused again, REDIRECT replies should be sent (since the new primary is ready to accept writes now.)

Additional information

Note that it isn't sufficient to just ensure that we aren't in a "failover" when sending redirect at https://github.com/valkey-io/valkey/blob/59aa00823c3691c50da452f2df2c4bc24e46696d/src/server.c#L3903, e.g.:

    if (!server.cluster_enabled && c->capa & CLIENT_CAPA_REDIRECT && server.primary_host && server.failover_state == NO_FAILOVER &&
        !obey_client && (is_write_command || (is_read_command && !c->flag.readonly))) {
        addReplyErrorSds(c, sdscatprintf(sdsempty(), "-REDIRECT %s:%d", server.primary_host, server.primary_port));
        return C_OK;
    }

Instead of REDIRECT, the client will see READONLY errors during FAILOVER_IN_PROGRESS (caused by https://github.com/valkey-io/valkey/blob/59aa00823c3691c50da452f2df2c4bc24e46696d/src/server.c#L3987), which as explained in #319, prevents a smooth switch.

We could add a corresponding server.failover_state == NO_FAILOVER clause to the respective if. However, it looks like MASTERDOWN could be triggered during FAILOVER_IN_PROGRESS as well if replica-serve-stale-data is no (I did not test this).

It may also be worth considering whether we should change the order of the if clauses (i.e. pause clients "earlier").

enjoy-binbin commented 1 month ago

thanks for the report, @soloestoy please also take a look.

soloestoy commented 1 month ago

This is an interesting scenario, and the initial solution seems to be able to maintain the "client pause write" state until FAILOVER_IN_PROGRESS ends. Additionally, there may also be a similar issue when performing cluster failover in cluster mode, right?

madolson commented 1 month ago

Additionally, there may also be a similar issue when performing cluster failover in cluster mode, right?

The redirect for CME is based on whether the node owns the slot, not whether the node is a primary. So, there shouldn't be a loop since one of the node always owns the slot.

This is an interesting scenario, and the initial solution seems to be able to maintain the "client pause write" state until FAILOVER_IN_PROGRESS ends

This generally sounds about right. We should be entering a pause situation on the primary instead of redirecting duing the failover.

gmbnomis commented 1 month ago

Just to avoid a misunderstanding (I may not have made this clear enough above): During standalone FAILOVER, writing clients will be paused during the entire failover and paused clients are unpaused when the failover finishes (successfully or not).

However, if a client issues the first write command since the beginning of the failover during the time window described above (or a read command in the case of the stale-replica condition), the order of the conditions in processCommand (redirect, ..., readonly, ..., stale-replica, ..., block client during pause) actually makes the client pause ineffective in some situations.

So, we might want to check which of those error replies we don't want to send during a failover. Or we could re-order conditions (or do a combination of the two). However, re-ordering looks too scary to me to even give it a try...

(I am not familiar with clustered setups, so I can't say anything about these. I looked into failover handling in more detail because of the sentinel improvement issue https://github.com/redis/redis/issues/13118)

soloestoy commented 1 month ago

I have checked cluster failover in the cluster mode. Unlike in standalone mode, in cluster failover, it is initiated by the replica:

The cluster failover command is executed on the replica;
The replica sends a CLUSTERMSG_TYPE_MFSTART message to the primary and enters manual failover state;
Upon receiving the CLUSTERMSG_TYPE_MFSTART message, the primary enters a pause write state and sets a timeout, then responds to the replica with its offset (along with the CLUSTERMSG_FLAG0_PAUSED flag);
The replica waits for its offset to be equal to that of the primary and then initiates a vote;
The replica is elected as the new primary and takes over the slot;
The replica becomes the new primary and broadcasts;
The old primary receives the new routing table, notices the change in slot ownership, automatically becomes a replica of the new primary, and unpause.

It can be seen that under normal circumstances, the primary does not unpause until the slot is taken over by its replica and it becomes the new primary. Therefore, in standalone mode, the unpause operation should be performed after receiving the reply to the PSYNC FAILOVER command sent to the replica.

soloestoy commented 1 month ago

    if (server.failover_state == FAILOVER_IN_PROGRESS) {
        if (psync_result == PSYNC_CONTINUE || psync_result == PSYNC_FULLRESYNC) {
            clearFailoverState();
        } else {
            abortFailover("Failover target rejected psync request");
            return;
        }
    }

We are already doing it this way now, @gmbnomis can you provide more details about your issue?

gmbnomis commented 1 month ago

@soloestoy Yes, the locations of pausing and unpausing clients are correct. However, "pausing clients" means that there is a flag indicating that clients issuing new commands (after the pause started) should be blocked.

Actually blocking clients while paused happens here in processCommand:

https://github.com/valkey-io/valkey/blob/59aa00823c3691c50da452f2df2c4bc24e46696d/src/server.c#L4061

And that check happens later in processCommand than e.g. the check for redirecting:

https://github.com/valkey-io/valkey/blob/59aa00823c3691c50da452f2df2c4bc24e46696d/src/server.c#L3903

As you said, clients are paused during the FAILOVER_IN_PROGRESS state. But this does not mean that they are already in blocked state. So:

If a client is already blocked due to a previous command, all is fine.
But if a client isn't blocked yet (because it just connected, or it did not send a write command yet) and sends a write command during the time window described in my original issue, processCommand will reply with a REDIRECT, because the part of processCommand that is responsible for blocking the client isn't reached.

gmbnomis commented 1 month ago

I found a way to force FAILOVER_WAIT_FOR_SYNC and FAILOVER_IN_PROGRESS respectively by letting the replica sleep (taken from the failover tests).

https://github.com/gmbnomis/valkey/commit/9a26faeb8b885d77e326ce43a7a63627fdfa19c0 is a test that:

demonstrates the erroneous REDIRECT response (instead of blocking the client) and the MASTERDOWN response if replica-serve-stale-data no is configured.

It also demonstrates another issue: When entering FAILOVER_IN_PROGRESS, the call to disconnectAllBlockedClients already disconnects blocked clients that were executing a blocking command with an UNBLOCKED response.

I think this has two problems:
1. In general, this is too early. There is no primary to connect to yet.
2. Analogous to the "MOVED" case in cluster mode, I think we should issue a REDIRECT to those clients (when we unblock the clients in general.)
Should I open a separate issue for this case?
verifies the correct responses in FAILOVER_WAIT_FOR_SYNC

soloestoy commented 1 month ago

Hi @gmbnomis , I think I understand your point. The main issue is during the FAILOVER_IN_PROGRESS period, both nodes may be replicas, causing -REDIRECT to bounce back and forth between the two replicas.

I have seen your PR, but it seems too complicated and does not completely solve the problem (please correct me if I am wrong). I believe a simple and effective approach would be to, when the primary becomes a replica, i.e., during the FAILOVER_IN_PROGRESS state, replace the behavior that should return -REDIRECT with pausing the client instead. Then, once the replica truly becomes the primary, i.e., after receiving the response to PSYNC FAILOVER, the suspended clients' commands will be resumed.

gmbnomis commented 1 month ago

Hi @soloestoy,

I have seen your PR, but it seems too complicated and does not completely solve the problem (please correct me if I am wrong).

my PR solves a different problem (handling of clients that are already blocked (not postponed) before the failover begins). So yes, it does not solve the problem described here.

valkey-io / valkey

[BUG] `CLIENT CAPA redirect` may result in bouncing redirects during failover #821