Open towest-nr opened 2 months ago
Thanks for the report! Which version of valkey are you using?
This was version 8.0.0-rc2
Perhaps I just need to wait until the replica is fully in sync before running a rebalance command?
I think this is the issue. We only replicate slot migration states to healthy replicas (introduced in #879). This is the reason why the replica missed the final CLUSTER SETSLOT <slot> NODE <node>
request since it was the only replica and was not online
at the time. We skipped replicating the state as a result. I think we should consider a solution in the cli tool so that the cli users don't need to explicitly watch for the replication status when resharding the cluster.
@enjoy-binbin FYI
@PingXie It seems risky that a replica can be left with stale importing/migrating states indefinitely. Ideally, such situation should resolve itself automatically eventually.
When a slot migration is complete, the cluster bus propagates the new ownership of the slot to all nodes in the cluster. When this information reaches a replica with stale importing/migrating states, should clear this stale state in the replica?
I updated my operator to wait for the replica to reach a healthy state before kicking off the rebalance operation and that does appear to have prevented the open slot issue.
However, I agree that it would be ideal if this type of situation can be resolved automatically without having to build custom logic into an operator to handle it, or even worse requiring a person to have to detect and resolve it manually.
It seems risky that a replica can be left with stale importing/migrating states indefinitely. Ideally, such situation should resolve itself automatically eventually.
Agreed. We decided to trade consistency for availability with #879 though, which I still think is the right call.
When a slot migration is complete, the cluster bus propagates the new ownership of the slot to all nodes in the cluster. When this information reaches a replica with stale importing/migrating states, should clear this stale state in the replica?
That is a good point. Let me double check on that next.
When a slot migration is complete, the cluster bus propagates the new ownership of the slot to all nodes in the cluster. When this information reaches a replica with stale importing/migrating states, should clear this stale state in the replica?
i thought we already doing this, so in #879, i make that change. I can also take a look today
When a slot migration is complete, the cluster bus propagates the new ownership of the slot to all nodes in the cluster. When this information reaches a replica with stale importing/migrating states, should clear this stale state in the replica?
Yes I can confirm that the server handles this case correctly.
Even better, we also handle the case where the primary (of slot 52 in this case) fails immediately and its replica is elected to be the new primary with slot 52 stuck in importing
.
I believe if @towest-nr waited a bit more (cluster_node_timeout
), the cluster will go back to a consistent and correct state. @towest-nr, do you mind giving a try?
Also, I think the full sync is a key component in reproducing this issue. I’m curious to know if there are any additional insights we can gain from this. @towest-nr, would you mind sharing the complete logs from both the primary and replica leading up to the full sync? If you could share replication-related configs, that will be even more helpful.
I believe if @towest-nr waited a bit more (cluster_node_timeout), the cluster will go back to a consistent and correct state. @towest-nr, do you mind giving a try?
@PingXie Just to make sure I understand, is the request here that I wait more than cluster-node-timeout (15 seconds in our case) before doing the rebalance? If so, that's essentially what I've updated my operator to do now and it seems to be working. But it still requires that we not run a rebalance too close to the scale-up action.
@towest-nr, would you mind sharing the complete logs from both the primary and replica leading up to the full sync? If you could share replication-related configs, that will be even more helpful.
Sure. I've attached the logs for both the primary and replica leading up to the rebalance and then a bit after as well. There's nothing else interesting in the logs except just the printout of every assigning/importing statement.
As for configs, we're using the defaults for pretty much everything aside from:
client-output-buffer-limit 'replica 2147483648 2147483648 60'
replica-lazy-flush no
appendonly no
io-threads 3
io-threads-do-reads yes
cluster-allow-replica-migration no
save ''
Thanks @towest-nr! This is just to confirm my understanding and I think what you experienced checks out.
If so, that's essentially what I've updated my operator to do now and it seems to be working.
We can probably consider adding some retry in the valkey-cli tool to handle this situation more gracefully.
Will look into the logs next.
@towest-nr, would you mind sharing the complete logs from both the primary and replica leading up to the full sync? If you could share replication-related configs, that will be even more helpful.
Sure. I've attached the logs for both the primary and replica leading up to the rebalance and then a bit after as well. There's nothing else interesting in the logs except just the printout of every assigning/importing statement.
Thanks for the logs. There was no re-sync involved in this case. The replica was going through the initial full sync after CLUSTER REPLICATE
. It was clear that this full-sync replication was racing with the slot migration. What I don't understand is how the importing
state of slot 52 got replicated to the replica before it even finished the first full sync. I don't see how a replica can become online before it finishes the full sync. will need to dig some more.
1:M 06 Sep 2024 18:55:32.847 * Importing slot 52 from node f90d51f138c86ee0bada66afc89e2916da8ab849 () // replica is ONLINE but why?
1:M 06 Sep 2024 18:55:32.869 * Starting BGSAVE for SYNC with target: replicas sockets using: normal sync
1:M 06 Sep 2024 18:55:32.870 * Background RDB transfer started by pid 47 to pipe through parent process
1:M 06 Sep 2024 18:55:32.877 * Assigning slot 52 to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef // replica not ONLINE and this is expected because the full sync is not complete yet
@PingXie The MIGRATING/IMPORTING states are stored in the RDB, right? Can the replica have gotten it there as part of the full sync?
@PingXie The MIGRATING/IMPORTING states are stored in the RDB, right? Can the replica have gotten it there as part of the full sync?
Ah that is it! Thanks, @zuiderkwast! I think we have a complete understanding of this issue.
Describe the bug
While running a 2 shard cluster with 1 replica per shard I'm executing a scale-up action by bringing a new shard online, adding it to the cluster and then running a
valkey-cli --cluster rebalance ... --cluster-use-empty-masters
to evenly re-assign slots. Sometimes this process works without any issue but on multiple occasions the cluster has been left with one or more slots stuck in either animporting
ormigrating
state.Example
--cluster check
output after the rebalance command completed successfully:In this case, I ran a
cluster setslot 52 stable
against the primary for replica 86c71d57486a1b3d7b26b2c71c4538c304e67432 and the open slot warning disappeared.To reproduce
This does not reproduce for me every single time but I have only been able to reproduce it while running under an average amount of load. Testing in a local, empty cluster does not appear to yield the same failure scenario. The rough steps are:
valkey-cli --cluster rebalance ... --cluster-use-empty-masters
command to kick of the rebalancevalkey-cli --cluster check 127.0.0.1 6379
command to look for open slotsExpected behavior
The rebalance should complete successfully without leaving any slots open.
Additional information
While digging through the logs of the new primary that was brought online and used during the rebalance I found the following, which might be a contributing factor:
The logs above are from the primary node that got brought online. The replica for this primary had the following logs at the same time:
Perhaps I just need to wait until the replica is fully in sync before running a rebalance command? Or is this something that Valkey should be able to handle automatically?