SUNSUBSCRIBE corner case causes client out-of-sync

zuiderkwast commented 1 month ago

When a client is subscribed to a sharded-pubsub channel in cluster mode and the slot is moved to another shard, or deleted, the client receives a spontaneous sunsubscribe push message.

If a client has just sent a SUNSUBSCRIBE command, the client cannot know if the sunsubscribe message is a response to the command or a sponaneous message.

In the following scenario, client 1 believes that the [sunsubscribe, ch, 0] push message is received as a response to SUNSUBSCRIBE. The CLUSTERDOWN error reply remains to be read and it appears to be out of sync, i.e. to the client it doesn't appear to match a command it has sent. (If the client has sent another command in the pipeline, the CLUSTERDOWN appears to be its reply.)

Client 1                Client 2            Primary
  |                          |                 |
  |                          |  DELSLOT        |
  |                          |---------------->|
  | SUNSUBSCRIBE ch          |                 |
  |------------------------------------------->|
  |                          |      OK         |
  | [sunsubscribe, ch, 0]    |<----------------|
  |<-------------------------------------------|
  |                          |                 |
  | -CLUSTERDOWN             |                 |
  |<-------------------------------------------|
  |                          |                 |

Originally posted by @zuiderkwast in https://github.com/valkey-io/valkey/issues/759#issuecomment-2369722723 (but edited)

Test case

This test case passes, i.e. it illustrates what the clients actually see. ```diff diff --git a/tests/unit/cluster/pubsubshard-slot-migration.tcl b/tests/unit/cluster/pubsubshard-slot-migration.tcl index c5a324f09..26d6afe56 100644 --- a/tests/unit/cluster/pubsubshard-slot-migration.tcl +++ b/tests/unit/cluster/pubsubshard-slot-migration.tcl @@ -187,6 +187,32 @@ test "Delete a slot, verify sunsubscribe message" { $subscribeclient close } +test "Slot deleted and unsubscribed simulaneously" { + set channelname ch5 + set slot [$cluster cluster keyslot $channelname] + + array set primary_client [$cluster masternode_for_slot $slot] + + set subscribeclient [valkey_deferring_client_by_addr $primary_client(host) $primary_client(port)] + $subscribeclient HELLO 3 + $subscribeclient read + $subscribeclient SSUBSCRIBE $channelname + $subscribeclient read + + # Delete a slot. + assert_equal OK [$primary_client(link) CLUSTER DELSLOTS $slot] + + # Send in a pipeline SUNSUBSCRIBE and DBSIZE + $subscribeclient SUNSUBSCRIBE $channelname + $subscribeclient DBSIZE + + # We expect one reply per command, plus an implicit sunsubscribed message. + assert_equal "sunsubscribe $channelname 0" [$subscribeclient read] + catch {$subscribeclient read} e + assert_equal "CLUSTERDOWN Hash slot not served" $e + assert_equal 0 [$subscribeclient read] +} + test "Reset cluster, verify sunsubscribe message" { set channelname ch4 set slot [$cluster cluster keyslot $channelname] ```

ranshid commented 1 month ago

Nice!

I guess this could also be solved when we introduce atomic slot migration right?

zuiderkwast commented 1 month ago

I guess this could also be solved when we introduce atomic slot migration right?

No, how would that solve it? This is about the subscribed client getting a spontaneous message when the slot migration happens.

ikolomi commented 1 month ago

Please note that although cumbersome, the resulting local subscription state could be tracked correctly - receiving SUNSUBSCRIBE notification (regardless if caused by the an active command or due to slot migration) allows client to know that it will no longer receive notifications on the channel in question.

If a client has just sent a SUNSUBSCRIBE command, the client cannot know if the sunsubscribe message is a response to the command or a sponaneous message.

Thus, the above does not really matter, although indeed complicates the unsubscription logic.

zuiderkwast commented 1 month ago

Thus, the above does not really matter, although indeed complicates the unsubscription logic.

@ikolomi What do you mean it does not really matter? The -CLUSTERDOWN (or MOVED) is still in the reply pipeline. If the client sends a GET foo or any other command after the SUNSUBSCRIBE, the CLUSTERDOWN or MOVED will appear to be the reply to that GET command. Can you explain how a client can tell whether the error is for the SUNSUBSCRIBE or the GET?

ranshid commented 1 month ago

I guess this could also be solved when we introduce atomic slot migration right?

No, how would that solve it? This is about the subscribed client getting a spontaneous message when the slot migration happens.

Oh I assumed that during atomic slot ownership the command will be blocked until we identify the ownership was migrated so MOVED response will be sent, but you are right that it can still be a problem if we allow unblocking and running the command so the client would get MOVED error which can still be identified as a response for another command.

ranshid commented 1 month ago

I guess this could also be solved when we introduce atomic slot migration right?

No, how would that solve it? This is about the subscribed client getting a spontaneous message when the slot migration happens.

Oh I assumed that during atomic slot ownership the command will be blocked until we identify the ownership was migrated so MOVED response will be sent, but you are right that it can still be a problem if we allow unblocking and running the command so the client would get MOVED error which can still be identified as a response for another command.

BTW I wonder if preventing MOVED error and just reply [sunsubscribe, ch, 0] again would make things better? I mean there is no real point in redirecting unsubscribe messages to a different node IMO.

zuiderkwast commented 1 month ago

Ran, we already get a MOVED today in this race condition. (My example above uses CLUSTERDOWN instead because I deleted the slot instead of migrated it, but it's the only difference.) The client just doesn't understand that this MOVED is the reply to the SUNSUBSCRIBE command it just sent, because there is a 'sunsubscribe' push message in the reply pipeline which arrives to the client first.

When the client is subscribed and doesn't do anything, when a slot is migrated we send an 'sunsubscribed' push notification to the client.

ranshid commented 1 month ago

Ran, we already get a MOVED today in this race condition. (My example above uses CLUSTERDOWN instead because I deleted the slot instead of migrated it, but it's the only difference.) The client just doesn't understand that this MOVED is the reply to the SUNSUBSCRIBE command it just sent, because there is a 'sunsubscribe' push message in the reply pipeline which arrives to the client first.

When the client is subscribed and doesn't do anything, when a slot is migrated we send an 'sunsubscribed' push notification to the client.

Thank you @zuiderkwast, I understand the point. So the client would probably have a double reroute in the case he was pipelining more requests (like ssubscribe or other commands in case of resp3) right? Probably double moved will also cause the client to perform new topology refresh which in turn can lead to high load on the server if there are many subscribers on the same slot/s. I was only thinking if instead of MOVED we could just return that there are zero subscribers on the channel, but later I also thought it might still break some stateless clients. I currently have no good solution for that.

zuiderkwast commented 1 month ago

Load on the server is not the issue. The client getting out of sync and returning the wrong replies back to the user application is.

The best way to avoid this is probably to just never send UNSUBSCRIBE commands. That's what GLIDE does.

It seems a bit broken though and it'd be nicer if the behaviour were different, like if every command would always get one in-band reply (in RESP3). That'd have to be client opt-in though. If we do this, then we can send +OK as a reply to the subscribe/unsubscribe commands when successful and the push replies can be handled completely out-of-band as they were intended to.

zuiderkwast commented 1 month ago

To clarify client getting out of sync: An async client sends a constant stream of commands, receives the stream of repsonseses and needs to match each response with the command it corresponds to, to be able to invoke the right callback or otherwise match it with the code that sent each command. That's what GLIDE does (iiuc) and that's what hiredis does in the async API.

If a client sends GET foo; UNSUBSCRIBE bar; DBSIZE; PING; SET x y and the server responds with "bar"; ("unsubscribed", "bar", 0); -MOVED 1234 xx:6379; 72329; "PONG"; "OK", the client has no way to see that 72329 is the reply to DBSIZE and "PONG" is the reply to PING. It gets out of sync and returns "MOVED" to DBSIZE, 72329 to PING, "PONG" to SET..

ikolomi commented 1 month ago

@zuiderkwast

Thus, the above does not really matter, although indeed complicates the unsubscription logic.

@ikolomi What do you mean it does not really matter? The -CLUSTERDOWN (or MOVED) is still in the reply pipeline. If the client sends a GET foo or any other command after the SUNSUBSCRIBE, the CLUSTERDOWN or MOVED will appear to be the reply to that GET command. Can you explain how a client can tell whether the error is for the SUNSUBSCRIBE or the GET?

I mean that the current subscription state could still be correctly constructed on the client side. Now, regarding the ordering of the commands - Yes, it seems to me that the protocol is broken because the unsubscribe commands return either error or cause a RESP notification. Since the notifications are server originated and asynchronous, it might be impossible to associate the unsubscribe command with the unsubscribe notifications. One way to deal with it is by the following logic:

drain the reply pipeline
send the unsubscribe command
wait for an error reply, timeout or unsubscribe notification
if timed out assume the command succeeded but no channels unsubscribed -> assume client is not subscribed on the channel
if error reply, associate the error with the unsubscribe command -> be on the safe side, assume client subscribed and tear down the connection to ensure unsubscription.
if unsubscribe notification -> assume client is not subscribed on the channel

The problematic parts here are

How long to wait for the notification before assuming the command succeeded and the next reply will not be for this command (As in your example)
Between the draining and the sending of the unsubscribe command, the server might send an unsubscribe notification, breaking the above logic (as the notification will incorrectly indicate a successful command)

Due to these complications, we dont use unsubscription commands in GLIDE, relying on the transport tear down instead.

zuiderkwast commented 1 month ago

if timed out assume the command succeeded but no channels unsubscribed -> assume client is not subscribed on the channel

If no channels are subscribed, the server still sends one notification on the form (unsubscribe, ch, 0) to an UNSUBSCRIBE command (or, since valkey 8.0.0, a -NOSUB error), so I don't think timeout can happen.

Between the draining and the sending of the unsubscribe command, the server might send an unsubscribe notification, breaking the above logic

Yes, this part can still happen even with draining.

ranshid commented 1 month ago

Load on the server is not the issue. The client getting out of sync and returning the wrong replies back to the user application is.

The best way to avoid this is probably to just never send UNSUBSCRIBE commands. That's what GLIDE does.

It seems a bit broken though and it'd be nicer if the behaviour were different, like if every command would always get one in-band reply (in RESP3). That'd have to be client opt-in though. If we do this, then we can send +OK as a reply to the subscribe/unsubscribe commands when successful and the push replies can be handled completely out-of-band as they were intended to.

@zuiderkwast I Understand the problem you refer to and thank you for the patience putting up with my questions :) The way Glide solves the issue is by totally disallowing users from partial unsubscription on a client so they have to teardown the client (@ikolomi correct me if I am wrong). this is somewhat suboptimal IMO but maybe there aren't many such usecases. One I can think of is when the user places temporal keyspace notifications on specific keys (during some sort of client side caching) and removing them after the notification arrives. I like the suggestion you refer to in resp3 to return the response in-band and the notifications in push. we can even provide an option to disconnect the resp2 subscribers for the channel when a slot is migrated. Do you think it is reasonable to extend the CLIENT CAPA to opt-in to this behavior? (maybe as an addition to #962)?

zuiderkwast commented 1 month ago

(maybe as an addition to #962)?

Instead of #962 IMO. CLIENT CAPA pubsub-ok. :grinning:

valkey-io / valkey

SUNSUBSCRIBE corner case causes client out-of-sync #1066