[BUG] slot stuck in importing state on replica after scale-up and rebalance

towest-nr commented 2 months ago

Describe the bug

While running a 2 shard cluster with 1 replica per shard I'm executing a scale-up action by bringing a new shard online, adding it to the cluster and then running a valkey-cli --cluster rebalance ... --cluster-use-empty-masters to evenly re-assign slots. Sometimes this process works without any issue but on multiple occasions the cluster has been left with one or more slots stuck in either an importing or migrating state.

Example --cluster check output after the rebalance command completed successfully:

127.0.0.1:6379 (1f47aff9...) -> 634702 keys | 5462 slots | 1 replicas.
100.64.172.55:6379 (cfe5e107...) -> 633472 keys | 5461 slots | 1 replicas.
100.64.92.212:6379 (f90d51f1...) -> 634590 keys | 5461 slots | 1 replicas.
[OK] 1902764 keys in 3 primaries.
116.14 keys per slot on average.
>>> Performing Cluster Check (using node 127.0.0.1:6379)
M: 1f47aff94857bb03d7932b01a53fddbfcb44da35 127.0.0.1:6379
   slots:[0-5461] (5462 slots) master
   1 additional replica(s)
S: 86c71d57486a1b3d7b26b2c71c4538c304e67432 100.64.3.150:6379
   slots: (0 slots) slave
   replicates 1f47aff94857bb03d7932b01a53fddbfcb44da35
M: cfe5e107fce3bca937f4c26e96fd8576b2db5c11 100.64.172.55:6379
   slots:[5462-8191],[10923-13653] (5461 slots) master
   1 additional replica(s)
S: 537c00be765db05a7c00b03fdbc149b2ffd728b0 100.64.64.34:6379
   slots: (0 slots) slave
   replicates f90d51f138c86ee0bada66afc89e2916da8ab849
M: f90d51f138c86ee0bada66afc89e2916da8ab849 100.64.92.212:6379
   slots:[8192-10922],[13654-16383] (5461 slots) master
   1 additional replica(s)
S: ba740172376b6be343d28079741a183d83d2fc5b 100.64.145.169:6379
   slots: (0 slots) slave
   replicates cfe5e107fce3bca937f4c26e96fd8576b2db5c11
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
[WARNING] Node 100.64.3.150:6379 has slots in importing state 52.
[WARNING] The following slots are open: 52.
>>> Check slots coverage...
[OK] All 16384 slots covered.

In this case, I ran a cluster setslot 52 stable against the primary for replica 86c71d57486a1b3d7b26b2c71c4538c304e67432 and the open slot warning disappeared.

To reproduce

This does not reproduce for me every single time but I have only been able to reproduce it while running under an average amount of load. Testing in a local, empty cluster does not appear to yield the same failure scenario. The rough steps are:

Get a cluster set up with 2 shards, 1 replica per shard. No persistence.
Apply write load to the leaders and read load to the replicas
Bring a new shard online and ensure it's added to the cluster (no slots assigned initially)
Run a valkey-cli --cluster rebalance ... --cluster-use-empty-masters command to kick of the rebalance
After the rebalance is complete run a valkey-cli --cluster check 127.0.0.1 6379 command to look for open slots

Expected behavior

The rebalance should complete successfully without leaving any slots open.

Additional information

While digging through the logs of the new primary that was brought online and used during the rebalance I found the following, which might be a contributing factor:

1:M 06 Sep 2024 18:55:32.847 * Importing slot 52 from node f90d51f138c86ee0bada66afc89e2916da8ab849 ()
1:M 06 Sep 2024 18:55:32.869 * Starting BGSAVE for SYNC with target: replicas sockets using: normal sync
1:M 06 Sep 2024 18:55:32.870 * Background RDB transfer started by pid 47 to pipe through parent process
1:M 06 Sep 2024 18:55:32.877 * Assigning slot 52 to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef
1:M 06 Sep 2024 18:55:32.885 * Importing slot 53 from node f90d51f138c86ee0bada66afc89e2916da8ab849 ()

The logs above are from the primary node that got brought online. The replica for this primary had the following logs at the same time:

1:S 06 Sep 2024 18:55:32.837 * Slot 51 is migrated from node f90d51f138c86ee0bada66afc89e2916da8ab849 () in shard 1d2a9e8aa5b4937c47c48a96aa294da2f77a68bd to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef.
1:S 06 Sep 2024 18:55:32.869 * Full resync from primary: fb06637f3494b2fe80b9395374bf28e165c8f470:10756256
1:S 06 Sep 2024 18:55:32.878 * Slot 52 is migrated from node f90d51f138c86ee0bada66afc89e2916da8ab849 () in shard 1d2a9e8aa5b4937c47c48a96aa294da2f77a68bd to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef.
1:S 06 Sep 2024 18:55:32.912 * Slot 53 is migrated from node f90d51f138c86ee0bada66afc89e2916da8ab849 () in shard 1d2a9e8aa5b4937c47c48a96aa294da2f77a68bd to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef.
1:S 06 Sep 2024 18:55:32.955 * Slot 54 is migrated from node f90d51f138c86ee0bada66afc89e2916da8ab849 () in shard 1d2a9e8aa5b4937c47c48a96aa294da2f77a68bd to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef.
1:S 06 Sep 2024 18:55:32.961 * PRIMARY <-> REPLICA sync: Flushing old data
1:S 06 Sep 2024 18:55:32.961 * PRIMARY <-> REPLICA sync: Loading DB in memory
1:S 06 Sep 2024 18:55:32.967 * Loading RDB produced by Valkey version 7.9.240
1:S 06 Sep 2024 18:55:32.967 * RDB age 0 seconds
1:S 06 Sep 2024 18:55:32.967 * RDB memory usage when created 29.26 Mb
1:S 06 Sep 2024 18:55:33.006 * Done loading RDB, keys loaded: 6041, keys expired: 0.
1:S 06 Sep 2024 18:55:33.006 * PRIMARY <-> REPLICA sync: Finished with success
1:S 06 Sep 2024 18:55:33.011 * Assigning slot 55 to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef

Perhaps I just need to wait until the replica is fully in sync before running a rebalance command? Or is this something that Valkey should be able to handle automatically?

zuiderkwast commented 2 months ago

Thanks for the report! Which version of valkey are you using?

towest-nr commented 2 months ago

This was version 8.0.0-rc2

PingXie commented 2 months ago

Perhaps I just need to wait until the replica is fully in sync before running a rebalance command?

I think this is the issue. We only replicate slot migration states to healthy replicas (introduced in #879). This is the reason why the replica missed the final CLUSTER SETSLOT <slot> NODE <node> request since it was the only replica and was not online at the time. We skipped replicating the state as a result. I think we should consider a solution in the cli tool so that the cli users don't need to explicitly watch for the replication status when resharding the cluster.

@enjoy-binbin FYI

zuiderkwast commented 2 months ago

@PingXie It seems risky that a replica can be left with stale importing/migrating states indefinitely. Ideally, such situation should resolve itself automatically eventually.

When a slot migration is complete, the cluster bus propagates the new ownership of the slot to all nodes in the cluster. When this information reaches a replica with stale importing/migrating states, should clear this stale state in the replica?

towest-nr commented 2 months ago

I updated my operator to wait for the replica to reach a healthy state before kicking off the rebalance operation and that does appear to have prevented the open slot issue.

However, I agree that it would be ideal if this type of situation can be resolved automatically without having to build custom logic into an operator to handle it, or even worse requiring a person to have to detect and resolve it manually.

PingXie commented 2 months ago

It seems risky that a replica can be left with stale importing/migrating states indefinitely. Ideally, such situation should resolve itself automatically eventually.

Agreed. We decided to trade consistency for availability with #879 though, which I still think is the right call.

When a slot migration is complete, the cluster bus propagates the new ownership of the slot to all nodes in the cluster. When this information reaches a replica with stale importing/migrating states, should clear this stale state in the replica?

That is a good point. Let me double check on that next.

enjoy-binbin commented 2 months ago

When a slot migration is complete, the cluster bus propagates the new ownership of the slot to all nodes in the cluster. When this information reaches a replica with stale importing/migrating states, should clear this stale state in the replica?

i thought we already doing this, so in #879, i make that change. I can also take a look today

PingXie commented 2 months ago

When a slot migration is complete, the cluster bus propagates the new ownership of the slot to all nodes in the cluster. When this information reaches a replica with stale importing/migrating states, should clear this stale state in the replica?

Yes I can confirm that the server handles this case correctly.

https://github.com/valkey-io/valkey/blob/58fe9c0138af8a45dfcb906a3d5c631a6d6e9a30/src/cluster_legacy.c#L2504

Even better, we also handle the case where the primary (of slot 52 in this case) fails immediately and its replica is elected to be the new primary with slot 52 stuck in importing.

https://github.com/valkey-io/valkey/blob/58fe9c0138af8a45dfcb906a3d5c631a6d6e9a30/src/cluster_legacy.c#L2560

I believe if @towest-nr waited a bit more (cluster_node_timeout), the cluster will go back to a consistent and correct state. @towest-nr, do you mind giving a try?

PingXie commented 2 months ago

Also, I think the full sync is a key component in reproducing this issue. I’m curious to know if there are any additional insights we can gain from this. @towest-nr, would you mind sharing the complete logs from both the primary and replica leading up to the full sync? If you could share replication-related configs, that will be even more helpful.

towest-nr commented 2 months ago

I believe if @towest-nr waited a bit more (cluster_node_timeout), the cluster will go back to a consistent and correct state. @towest-nr, do you mind giving a try?

@PingXie Just to make sure I understand, is the request here that I wait more than cluster-node-timeout (15 seconds in our case) before doing the rebalance? If so, that's essentially what I've updated my operator to do now and it seems to be working. But it still requires that we not run a rebalance too close to the scale-up action.

@towest-nr, would you mind sharing the complete logs from both the primary and replica leading up to the full sync? If you could share replication-related configs, that will be even more helpful.

Sure. I've attached the logs for both the primary and replica leading up to the rebalance and then a bit after as well. There's nothing else interesting in the logs except just the printout of every assigning/importing statement.

replica-2.log

primary-2.log

As for configs, we're using the defaults for pretty much everything aside from:

client-output-buffer-limit 'replica 2147483648 2147483648 60'
replica-lazy-flush no
appendonly no
io-threads 3
io-threads-do-reads yes
cluster-allow-replica-migration no
save ''

PingXie commented 2 months ago

Thanks @towest-nr! This is just to confirm my understanding and I think what you experienced checks out.

If so, that's essentially what I've updated my operator to do now and it seems to be working.

We can probably consider adding some retry in the valkey-cli tool to handle this situation more gracefully.

Will look into the logs next.

PingXie commented 2 months ago

@towest-nr, would you mind sharing the complete logs from both the primary and replica leading up to the full sync? If you could share replication-related configs, that will be even more helpful.

Sure. I've attached the logs for both the primary and replica leading up to the rebalance and then a bit after as well. There's nothing else interesting in the logs except just the printout of every assigning/importing statement.

Thanks for the logs. There was no re-sync involved in this case. The replica was going through the initial full sync after CLUSTER REPLICATE. It was clear that this full-sync replication was racing with the slot migration. What I don't understand is how the importing state of slot 52 got replicated to the replica before it even finished the first full sync. I don't see how a replica can become online before it finishes the full sync. will need to dig some more.

1:M 06 Sep 2024 18:55:32.847 * Importing slot 52 from node f90d51f138c86ee0bada66afc89e2916da8ab849 () // replica is ONLINE but why?
1:M 06 Sep 2024 18:55:32.869 * Starting BGSAVE for SYNC with target: replicas sockets using: normal sync
1:M 06 Sep 2024 18:55:32.870 * Background RDB transfer started by pid 47 to pipe through parent process
1:M 06 Sep 2024 18:55:32.877 * Assigning slot 52 to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef // replica not ONLINE and this is expected because the full sync is not complete yet

zuiderkwast commented 2 months ago

@PingXie The MIGRATING/IMPORTING states are stored in the RDB, right? Can the replica have gotten it there as part of the full sync?

PingXie commented 2 months ago

@PingXie The MIGRATING/IMPORTING states are stored in the RDB, right? Can the replica have gotten it there as part of the full sync?

Ah that is it! Thanks, @zuiderkwast! I think we have a complete understanding of this issue.

valkey-io / valkey

[BUG] slot stuck in importing state on replica after scale-up and rebalance #998