redis / redis

Redis is an in-memory database that persists on disk. The data model is key-value, but many different kind of values are supported: Strings, Lists, Sets, Sorted Sets, Hashes, Streams, HyperLogLogs, Bitmaps.
http://redis.io
Other
66.89k stars 23.79k forks source link

Retain a down primary's slot info after automatic fail over #10155

Open PingXie opened 2 years ago

PingXie commented 2 years ago

The problem/use-case that the feature addresses

Today when running an HA Redis cluster, if the machine on which a primary was running is permanently down, the Redis cluster automatically fails over the primary-ship to a replica. This is all good. Now one way to bring the redundancy level back is to start a Redis server on a new and healthy machine and then add it as a replica to the impacted shard. The problem here is that there is no easy way to quickly determine from which primary the new Redis server instance should replicate. We can see from "CLUSTER NODES" the old/down primary but there are no slot ranges associated with it. We could manually count the number of live replicas in each shard and then infer from there that one of the shards is short of replicas. This however can get tedious and error prone when there are many shards in the cluster.

Description of the feature

The ask here is to retain the down primary's slot info in a "back pocket" in the clusterNode structure such that this slot info is not lost after an automatic fail-over. This information can then be used to identify the new primary owner. Here is an example:

72f66f4ecbcfc3055faae01a68bbeadc5079bb92 127.0.0.1:30001@40001 master,fail - 1642667704665 1642667703661 1 disconnected 0-5460

Alternatives you've considered

Manually count number of replicas in each shard and figure out their zonal location.

Additional information

Note that in the non-HA case, this slot info is retained since there is no automatic fail over. The proposed change helps close the inconsistency between HA and non-HA cases.

I have a private change ready to be submitted.

PingXie commented 2 years ago

@madolson I am planning to repurpose this issue and add the failed node tracking logic to the new CLUSTER SHARDS command (#10293). I would like to clarify with you on the high level requirements before resurrecting the PR.

  1. When a node A is declared FAIL by the cluster, existing nodes in the cluster should remember its shard membership and reflect that in the future CLUSTER SHARDS response;
  2. A node B, who joins the cluster later after A's departure, however will not automatically have this knowledge hence it will not include node A in its CLUSTER SHARDS response. Is this acceptable?
  3. A node C, who had exchanged gossip messages with node A in the past, gets restarted. Should node C remember A's shard membership after the restart? If the answer is yes, it would require a change to the nodes.conf format, i.e., adding a failed flag to the CLUSTER NODES output. This was considered an app-compat risk by @zuiderkwast and it is the reason why I abandoned the original fix.