[BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2)

valkey-io / valkey

A flexible distributed key-value datastore that is optimized for caching and other realtime workloads.

https://valkey.io

Other

17.14k stars 641 forks source link

[BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2) #774

Open bentotten opened 3 months ago

bentotten commented 3 months ago

Describe the bug

During cluster setup, the shard id gets established through cluster message extension data. For backwards compatibility reasons, this is delayed until it is established that the node can properly receive these extensions, leading to a propagation delay for the shard ID. When an engine crashes or restarts before the shard ID has stabilized, the config file can become corrupted, leading to failure to restart the engine.

To reproduce Set up a cluster and then immediately restart a node. It will (flakily) fail to restart due to a corrupted nodes.conf file - either because the replicas do not agree on the shard ID, or there is a shard ID mismatch.

Expected behavior

Engine restarts successfully.

Additional information

Related to this PR - https://github.com/valkey-io/valkey/pull/573

PingXie commented 3 months ago

For backwards compatibility reasons, this is delayed until it is established that the node can properly receive these extensions

I think this issue exists before the backcompat fix already but I can see this situation is exacerbated by the further delay introduced by the backcompat fix.

I think #573 is safe to be backported to 7.2. The replicaof cycle shouldn't occur on 7.2 IMO. That said, I am curious to know if this is something that you have encountered on 7.2 already or this is more of a preventive measure that you are thinking of?

bentotten commented 3 months ago

I think we found a way to avoid most of the issue without needing to reject the shard IDs from replicas - if we have the receiving node save the extensions supported flag on MEET for the sender, I believe it will send its extensions with the response PONG and remove a lot of this delay

bentotten commented 3 months ago

Thinking more, the proposed approach might not fix the delay for nodes met through gossip, so maybe the replica shard id rejection is still needed

Update - apparently gossip will be ok

bentotten commented 3 months ago

That said, I am curious to know if this is something that you have encountered on 7.2 already or this is more of a preventive measure that you are thinking of?

We are seeing it in test scenarios, yes

bentotten commented 3 months ago

Alternatively, we could send two MEETs - one with extensions attached and one without, as any unsupported packets will be thrown out by the receiver

jdork0 commented 3 months ago

We see this issue on upgrade from redis 7.0 to valkey 7.2 (or 8.0.0-rc1). The scenario is:

all nodes start at 7.0 (so no shard-id in cluster nodes file)
one at a time, stop 7.0 on a node, start 7.2, repeat for each node
if any of the 7.2 nodes restart again prior to all the nodes having reached 7.2, their nodes file is corrupt and won't start

The cluster nodes file is corrupt as each node is just assigned a random shard-id if non exists in the file and the whole cluster hasn't come up at 7.2 yet for them to stabilize.

PingXie commented 2 months ago

Right, #573 is not merged yet. We still have some issues with replicaOf cycles that seems to be a lot more prevalent with this change. We will need to root cause that first before we can consider the backport.

jdork0 commented 2 months ago

I backported https://github.com/valkey-io/valkey/pull/573 on top of 8.0.0-rc1 and I still see cases during upgrade before all the nodes have been upgraded where the 8.0.0 nodes do not have consistent shard-ids for all replicas in their nodes.conf file.

Should I raise a new issue for the upgrade scenario?

bentotten commented 2 months ago

@jdork0 do you still see this issue after pulling this? https://github.com/valkey-io/valkey/pull/778

jdork0 commented 2 months ago

I will try next week and get back to you.

jdork0 commented 2 months ago

@bentotten I do still see issues when pulling #778.

If you have a cluster running redis 7.0, where cluster nodes file has no shard-id, then shutdown all 7.0 nodes and start just a single 8.0 node, then it will start will a corrupted cluster file. Try to restart that node again and it fails.

I think the problem is the way clusterLoadConfig generates different random shard-ids for each node, even those covering the same shards, if the file doesn't contain shard-ids. I don't think this should generate a corrupt file then rely on cluster communication to fix it.

jdork0 commented 1 month ago

@madolson, should I open an issue for the upgrade from 7.0 corruption I described above?

bentotten commented 1 week ago

We are seeing these crashes live, re-opening issues

stevelipinski commented 1 week ago

I published a fix for this to redis. Ref to our local repo commit for one way to address this: https://github.com/nokia/redis-redis/commit/cd879cc152e48620b390f1f57f11566b9fbd135f

pieturin commented 6 days ago

The issue seems to come from here: https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L686-L691 As @stevelipinski mentioned, we can trust the primary's ShardId's value in this case instead of simply exiting the process.

@stevelipinski, would you be able to open a PR with your fix?