redis / lettuce

Advanced Java Redis client for thread-safe sync, async, and reactive usage. Supports Cluster, Sentinel, Pipelining, and codecs.
https://lettuce.io
MIT License
5.4k stars 975 forks source link

PartitionSelectorException in Lettuce Client Triggered by Addition of New Slave Nodes in Redis Cluster #2769

Open wmxl opened 8 months ago

wmxl commented 8 months ago

Bug Report

Current Behavior

A PartitionSelectorException occurs under specific conditions when a Redis cluster is adding a new slave node. If the Lettuce client's periodic topology refresh coincides with the addition of the new slave node, there's a small chance of encountering the io.lettuce.core.cluster.PartitionSelectorException: Cannot determine a partition to ...

Stack trace ```java io.lettuce .core .cluster.PartitionSelectorException: Cannot determine a partition for slot 4419. at io.lettuce.core.cluster.PooledclusterConnectionProvider .getwriteConnection (PooledclusterConnectionProvider . java: 164) ~[lettuce-core-6.2.3.RELEASE. jarl/ :6.2.3.RELEASE] at io.lettuce.core.cluster. PooledclusterConnectionProvider .getconnect ionAsyne(PooledclusterConnectionProvider. java: 149) -flettuce-core-6.2.3. RELEASE. jar1/ :6.2.3.RBLEASE] at io. lettuce.core.cluster.clusterDistributionChannelwriter. donrite(ClusterDistributionchannelwriter. java: 171) ~[lettuce-core-6.2.3.RELEASE. jar1/ :6.2.3.RELEASE] at io. 1ettuce.core.cluster.clusterDistributionChannelNziter.wzitefciusterDistribut ionChannelwriter. java:105) -f1ettuce-core-6.2.3.RBLEASE. jar1/:6.2.3.RELEASE] at io. lettuce.core.RedischannelHandler .dispatch(RedischannelHandler. java:215) ~[lettuce-core-6.2.3. RELEASE. jar1/ :6.2.3. RELEASE] at Ho. 1ettuce.core.clugter. StatefulReaisciusterconnectionInpl.aispateh(StatefulRedisclusterconnectionImel. Java:216) -flettuce-core-6. 2.3. RELEASE. Jar1 /:6.2.3. RBLEASE] at io. lettuce.core. AbstractRedisAsyneCommands .dispatch (AbstractRedisAsyncCommands . java: 676) -[lettuce-core-6.2.3. RELEASE. jar1/:6.2.3. RBLEASE] at io. lettuce.core. AbstractRedisAsyncCommands.del (AbstractRedisAsyncCommands. java: 629) -[lettuce-core-6.2.3. RELEASE. jar1/:6.2.3.RELEASE] at io. lettuce.core.cluster. RedisAdvancedclusterAsyncCommands Impl.del (RedisAdvancedClusterAsyncCommands Impl. java: 175) ~[lettuce-core-6.2.3. RELEASE. jar!/ :6.2.3.RELEASE] at io.lettuce.core.cluster. RedisAdvancedClusterAsyncCommands Impl.del (RedisAdvancedclusterAsyncConmands Impl. java: 166) ~[lettuce-core-6.2.3.RELEASE. jar1/:6.2.3.RBLBASE] at jdk. internal.reflect. GeneratedMethodAccessor591. invoke (Unknown Source) ~[?:?] 0 et idk. internal.refleet.DelegatingMethodhccessorInpl.invoke (DelegatingMethodnccessorInpl. java: 43) ~[?:2] at java. 1ang.reflect.Method. invoke (Method. java: 566) -[?:3] at io. lettuce. core. cluster.ClusterFuturesyncInvocationHlandler.handleInvocation(ClusterfuturesyncInvocationHandler. java: 122) -[lettuce-core-6.2.3. RELEASE. jar1/:6.2.3.RELEA at io.1ettuce.core.internal. AbstractInvocationlandler. invoke(AbstractInvocationHlandler. java: 80) -[lettuce-core-6.2.3.RBLEASE. jar1/:6.2.3.RELEASE] at com. sun.proxy . $Proxy141.del (Unknown Source) ~[?:2] at en.techwolf.cache. service. impl. LettuceclusterCacheserviceImpl. del(LettuceclusterCacheservice Impl. java:573)~ltechwolf-cache-client-1.0.129-SNAPSHOL: Jar1/:21 at com. zhipin. service.common.test. RedisClusterLoadrestController.delete (RedisClusterLoadTestController.java:264 zhipin-common-api-imp1-1.0.615-SNAPSHOT . jar! /:21 at com. zhipin. service. common.test. RedisclusterLoadrestController. lambda$mix$2 (RedisclusterLoadrestController.java:104) ~[zhipin-conmon-api-impl-1.0.615-SNAPSHOT . jar 1/:?] at java. util.concurrent . Bxecutors$RunnableAdapter.cal1 (Executors• java:515) [?:?] at java.util.concurrent.Futurerask.run(Futurerask.java: 264) [?:?] at java.util.concurrent. ThreadPoolExecutor.runWorker (ThreadPoolExecutor. java: 1128) [?:?] at java.util.concurrent. ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java: 628) [?:?] ```

Input Code

the code that is responsible for the bug

Input Code ```java static final class KnownMajority extends PartitionsConsensus { @Override Partitions getPartitions(Partitions current, Map topologyViews) { if (topologyViews.isEmpty()) { return current; } List votedList = new ArrayList<>(); for (Partitions partitions : topologyViews.values()) { int knownNodes = 0; for (RedisClusterNode knownNode : current) { if (partitions.getPartitionByNodeId(knownNode.getNodeId()) != null) { knownNodes++; } } votedList.add(new VotedPartitions(knownNodes, partitions)); } Collections.shuffle(votedList); Collections.sort(votedList, (o1, o2) -> Integer.compare(o2.votes, o1.votes)); return votedList.get(0).partitions; } } ```

Expected behavior/code

The expected behavior is for the Lettuce client to handle the topology refresh seamlessly without throwing a PartitionSelectorException, even when a new slave node is being added to the Redis cluster.

Environment

Possible Solution

The proposed solution involves modifying the KnownMajority class within PartitionsConsensus. The adjustment aims to account for master nodes with 0 assigned slots, which occur transiently when a new node is added to the cluster. This change ensures that these nodes are not incorrectly voted for during topology consensus, preventing the PartitionSelectorException.

// In PartitionsConsensusImpl.KnownMajority
// Additional logic to handle master nodes with 0 assigned slots during the addition of a new node
for (Partitions partitions : topologyViews.values()) {  
    for (Partitions partitions : topologyViews.values()) {
                int knownNodes = 0;
                for (RedisClusterNode knownNode : current) {
                    if (partitions.getPartitionByNodeId(knownNode.getNodeId()) != null) {
                        knownNodes++;
                        RedisClusterNode partitionByNodeId = partitions.getPartitionByNodeId(knownNode.getNodeId());

                         // master node should not have slots size = 0, this happened when a new node was added to cluster, [cluster nodes] command connect to
                        // the newly added node may return incorrect result, like this: nodeId ip@port master 0 1707379882000 0 disconnected (getSlots().size() == 0)
                        // normal result should be like this: nodeId ip@port master 0 1707379884310 54 connected 5461-10922 (getSlots().size() > 0)
                        // thus we should not vote for this kind of topology views that include master node with 0 assigned slots
                        if (partitionByNodeId.is(RedisClusterNode.NodeFlag.UPSTREAM) && partitionByNodeId.getSlots().size() == 0) {
                            knownNodes--;
                        }
                    }
                }
                votedList.add(new VotedPartitions(knownNodes, partitions));
            }
}

Additional context

The issue arises due to the two-step process of adding a slave node to a Redis cluster, where the node is initially added as a master before its role changes to a slave. If a topology refresh occurs during this period and connects to the newly added node, the returned cluster nodes information may be incorrect. This incorrect information leads to a situation where a master node might be reported with 0 assigned slots, triggering the PartitionSelectorException. The proposed solution in the KnownMajority class's logic aims to prevent this by adjusting the voting mechanism to disregard these transient states.

image

The result in the red box will cause the slot size of the 192.168.21.32:6479(master) in the topology obtained by Lettuce to be empty.(slots size = 0)

How to reproduce the error

  1. Set refreshPeriod to 1 second in the Lettuce client configuration to increase the frequency of topology refreshes.
  2. Utilize two shell scripts to simulate adding a new slave node to the cluster repeatedly: a.sh script removes a node from the cluster, stops the Redis server, cleans up data, and then re-adds the node as a slave to a specific master. cron.sh script repeatedly executes a.sh to simulate the node addition process multiple times, introducing potential timing conflicts with the topology refresh.
  3. Write a program that continuously executes simple Redis commands through the Lettuce client. As long as the program runs long enough, it will inevitably encounter the PartitionSelectorException. My two shell scripts: a.sh
    
    #!/usr/bin/env bash

source /etc/profile myid=$(redis-cli -h 192.168.26.159 -p 6479 -a g4BaiN7u9dMq cluster myid|tail -1) echo $myid redis-cli --cluster del-node 192.168.26.159:6479 $myid -a g4BaiN7u9dMq service redis-cluster stop 6479 rm -f /data/redis-6.2.10-Linux-x86_64/dbdata/6479/dump.rdb /data/redis-6.2.10-Linux-x86_64/conf/nodes.6479 echo 'Start Redis server after 10 sencods' sleep 10 echo 'Start Redis Server' service redis-cluster start 6479 echo 'Meet Node after 10 seconds......' sleep 10 echo 'Start Meet.......' redis-cli -a g4BaiN7u9dMq --cluster add-node 192.168.26.159:6479 192.168.26.59:6479 --cluster-slave --cluster-master-id 2ffd17608f6c56fd1ab8cb07679d6a32af8f54b4

cron.sh

!/usr/bin/env bash

source /etc/profile

for ((i=1; i<=100000; i++)) do

echo $i
./a.sh
echo "sleep 10 sencods"
sleep 10

done

mp911de commented 8 months ago

PartitionsConsensus isn't designed to sort out replica reconfiguration issues, it is intended to prevent partition splits (split-brain) so that the topology refresh stays with the group of nodes that shall remain in a cluster setup.

For your case, I suggest subclassing RedisClusterClient and overriding determinePartitions with a bit of code that suits your use-case.

wmxl commented 8 months ago

PartitionsConsensus isn't designed to sort out replica reconfiguration issues, it is intended to prevent partition splits (split-brain) so that the topology refresh stays with the group of nodes that shall remain in a cluster setup.

For your case, I suggest subclassing RedisClusterClient and overriding determinePartitions with a bit of code that suits your use-case.

Thank you for your response. I understand that PartitionsConsensus is primarily aimed at avoiding split-brain scenarios. However, adding slave nodes is a quite common operation for Redis clusters, and encountering PartitionSelectorException during this process is possible, even though the chance is quite small. (This happened twice in my company, during the expansion of our Redis cluster from a 1 master & 1 slave setup to a 1 master & 2 slaves configuration; one of our application servers experienced PartitionSelectorException for 15 seconds, which corresponds to our refresh period time.)

Given this, I believe enhancing Lettuce to better handle the addition of nodes during topology refreshes could benefit many users.

Since PartitionsConsensus is aimed at preventing partition splits, if we could add some filter logic in determinePartitions, something like this:

protected Partitions determinePartitions(Partitions current, Map<RedisURI, Partitions> topologyViews) {
        // Filter out invalid topology views where master nodes have 0 slots.
        Map<RedisURI, Partitions> filteredTopologyViews = new HashMap<>();
        for (Map.Entry<RedisURI, Partitions> entry : topologyViews.entrySet()) {
            Partitions partitions = entry.getValue();
            boolean isValid = true;

            for (RedisClusterNode node : partitions) {
                if (node.is(RedisClusterNode.NodeFlag.UPSTREAM) && node.getSlots().isEmpty()) {
                    isValid = false;
                    break;
                }
            }

            if (isValid) {
                filteredTopologyViews.put(entry.getKey(), partitions);
            }
        }

         // PartitionsConsensus codes...
    }

In my testing, after implementing such modifications, the PartitionSelectorException never occurred again, even in the midst of refreshing the topology upon adding nodes.

I hope this suggestion might be helpful, and I look forward to your thoughts.

mp911de commented 8 months ago

A node can be valid if it doesn't hold any slots, e.g. for Pub/Sub usage. I wondered whether it would make sense to make PartitionConsensus a public API so you could configure something in ClusterClientOptions.

Alternatively, you can enable adaptive topology refresh. Before the exception is thrown, the adaptive trigger UNCOVERED_SLOT is emitted and you can reduce the refresh period without touching periodic refresh.

wmxl commented 8 months ago

A node can be valid if it doesn't hold any slots, e.g. for Pub/Sub usage.

I see.

I wondered whether it would make sense to make PartitionConsensus a public API so you could configure something in ClusterClientOptions.

It would be better to prevent the client from getting incorrect cluster nodes information from the newly added nodes, moreover, it could potentially become the final determined partition.

Alternatively, you can enable adaptive topology refresh. Before the exception is thrown, the adaptive trigger UNCOVERED_SLOT is emitted and you can reduce the refresh period without touching periodic refresh.

We have enabled the adaptive topology refresh by calling enableAllAdaptiveRefreshTriggers(), but it seems it didn't work when the PartitionSelectorException was triggered by periodic refresh. I'm not sure if this is because they share the same timeout setting (we set both to 15s).