Open bentotten opened 3 months ago
For backwards compatibility reasons, this is delayed until it is established that the node can properly receive these extensions
I think this issue exists before the backcompat fix already but I can see this situation is exacerbated by the further delay introduced by the backcompat fix.
I think #573 is safe to be backported to 7.2. The replicaof cycle shouldn't occur on 7.2 IMO. That said, I am curious to know if this is something that you have encountered on 7.2 already or this is more of a preventive measure that you are thinking of?
I think we found a way to avoid most of the issue without needing to reject the shard IDs from replicas - if we have the receiving node save the extensions supported flag on MEET for the sender, I believe it will send its extensions with the response PONG and remove a lot of this delay
Thinking more, the proposed approach might not fix the delay for nodes met through gossip, so maybe the replica shard id rejection is still needed
Update - apparently gossip will be ok
That said, I am curious to know if this is something that you have encountered on 7.2 already or this is more of a preventive measure that you are thinking of?
We are seeing it in test scenarios, yes
Alternatively, we could send two MEETs - one with extensions attached and one without, as any unsupported packets will be thrown out by the receiver
We see this issue on upgrade from redis 7.0 to valkey 7.2 (or 8.0.0-rc1). The scenario is:
The cluster nodes file is corrupt as each node is just assigned a random shard-id if non exists in the file and the whole cluster hasn't come up at 7.2 yet for them to stabilize.
Right, #573 is not merged yet. We still have some issues with replicaOf
cycles that seems to be a lot more prevalent with this change. We will need to root cause that first before we can consider the backport.
I backported https://github.com/valkey-io/valkey/pull/573 on top of 8.0.0-rc1 and I still see cases during upgrade before all the nodes have been upgraded where the 8.0.0 nodes do not have consistent shard-ids for all replicas in their nodes.conf file.
Should I raise a new issue for the upgrade scenario?
@jdork0 do you still see this issue after pulling this? https://github.com/valkey-io/valkey/pull/778
I will try next week and get back to you.
@bentotten I do still see issues when pulling #778.
If you have a cluster running redis 7.0, where cluster nodes file has no shard-id, then shutdown all 7.0 nodes and start just a single 8.0 node, then it will start will a corrupted cluster file. Try to restart that node again and it fails.
I think the problem is the way clusterLoadConfig generates different random shard-ids for each node, even those covering the same shards, if the file doesn't contain shard-ids. I don't think this should generate a corrupt file then rely on cluster communication to fix it.
@madolson, should I open an issue for the upgrade from 7.0 corruption I described above?
We are seeing these crashes live, re-opening issues
I published a fix for this to redis. Ref to our local repo commit for one way to address this: https://github.com/nokia/redis-redis/commit/cd879cc152e48620b390f1f57f11566b9fbd135f
The issue seems to come from here: https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L686-L691 As @stevelipinski mentioned, we can trust the primary's ShardId's value in this case instead of simply exiting the process.
@stevelipinski, would you be able to open a PR with your fix?
Describe the bug
During cluster setup, the shard id gets established through cluster message extension data. For backwards compatibility reasons, this is delayed until it is established that the node can properly receive these extensions, leading to a propagation delay for the shard ID. When an engine crashes or restarts before the shard ID has stabilized, the config file can become corrupted, leading to failure to restart the engine.
To reproduce Set up a cluster and then immediately restart a node. It will (flakily) fail to restart due to a corrupted nodes.conf file - either because the replicas do not agree on the shard ID, or there is a shard ID mismatch.
Expected behavior
Engine restarts successfully.
Additional information
Related to this PR - https://github.com/valkey-io/valkey/pull/573