redis / lettuce

Advanced Java Redis client for thread-safe sync, async, and reactive usage. Supports Cluster, Sentinel, Pipelining, and codecs.
https://lettuce.io
MIT License
5.38k stars 968 forks source link

SentinelConnector set KnownNodes empty during redis failover, which causes app could not access redis any more #3017

Open jiajunsu opened 5 days ago

jiajunsu commented 5 days ago

Bug Report

Current Behavior

In redis sentinel mode, lettuce may refresh topology nodes to empty, if the connection to redis sentinel closed just after lettuce received TopologyRefreshMessage. And that will cause lettuce cannot recover anymore until received next TopologyRefreshMessage.

We have two redis nodes, and redis is running in sentinel mode. Assume the redis nodes' names are redis1 and redis2, and redis1 is master at the beginning, we inject errors as below:

  1. ifconfig down network at redis1, and wait the redis failover to redis2 finished
  2. ifconfig up network at redis1, and wait the redis failover to redis1 again
  3. ifconfig down network at redis2

At step3, redis sentinel send +sdown sentinel ... to lettuce client, and trigger lettuce executing method SentinelConnector::getTopologyRefreshRunnable. While at the same time, the connection between lettuce client and redis sentinel is closed, and then error occured.

lettuce logs below

[WARN] io.lettuce.core.RedisClient:605 Cannot connect Redis Sentinel at redis://*:*: java.util.concurrent.CompletionException: io.netty.channel.StacklessClosedChannelException
[WARN] io.lettuce.core.masterreplica.SentinelConnector:117 Topology refresh returned no nodes from redis-sentinel://**@*:*,*:*?sentinelMasterId=mymaster

And since that, the app could not handle any read and write operation because the knownNodes is empty.

org.springframework.data.redis.redisSystemException: Redis exception; nested exception is io.lettuce.core.RedisException: Cannot determine a node to read (Known nodes: []) while setting io.lettuce.core.ReadFromImpl$ReadFromUpstreamPreferred

org.springframework.data.redis.redisSystemException: Redis exception; nested exception is io.lettuce.core.RedisException: Master is currently unknown: []

By reviewing the git log, we've found this problem existed since commit Do not fail Master/Slave topology refresh if a node is not available. And it still exists at branch master;

When Requests#getRequest return null or future.isDone() returns false, it'll clear MasterReplicaConnectionProvider#knownNodes and the lettuce client could not get recovered untill receiving next TopologyRefreshMessage, or the lettuce client process was restarted.

src/main/java/io/lettuce/core/masterreplica/Requests.java ```java protected void onEmit(Emission> emission) { List result = new ArrayList<>(); Map latencies = new HashMap<>(); for (RedisNodeDescription node : nodes) { TimedAsyncCommand future = getRequest(node.getUri()); if (future == null || !future.isDone()) { // if this expression is true for all nodes, it'll clear MasterReplicaConnectionProvider#knownNodes continue; } RedisNodeDescription redisNodeDescription = findNodeByUri(nodes, node.getUri()); latencies.put(redisNodeDescription, future.duration()); result.add(redisNodeDescription); } SortAction sortAction = SortAction.getSortAction(); sortAction.sort(result, new LatencyComparator(latencies)); emission.success(result); } ```

Environment

Possible Solution

We've checked the lettuce code, the code in SentinelConnector::getTopologyRefreshRunnable may be improved, to avoid setting empty list to knownNodes.

src/main/java/io/lettuce/core/masterreplica/SentinelConnector.java ```java private Runnable getTopologyRefreshRunnable(MasterReplicaTopologyRefresh refresh, MasterReplicaConnectionProvider connectionProvider) { return () -> { try { LOG.debug("Refreshing topology"); refresh.getNodes(redisURI).subscribe(nodes -> { EventRecorder.getInstance().record(new MasterReplicaTopologyChangedEvent(redisURI, nodes)); if (nodes.isEmpty()) { LOG.warn("Topology refresh returned no nodes from {}", redisURI); return; // Add return here, to avoid setKnownNodes with empty nodes. } LOG.debug("New topology: {}", nodes); connectionProvider.setKnownNodes(nodes); }, t -> LOG.error("Error during background refresh", t)); } catch (Exception e) { LOG.error("Error during background refresh", e); } }; } ```
jiajunsu commented 3 days ago

Here is the modification to reproduce this issue.

https://github.com/redis/lettuce/commit/01f96a677b9036bfd04afb8c392c7f32cc1f49c4

Stack trace ```bash [INFO] Running io.lettuce.core.masterreplica.MyTest 2024-10-18 17:11:17 [ERROR] [main] TEST: getNodes called count 1 (MasterReplicaTopologyRefresh:60) 2024-10-18 17:11:18 [ERROR] [lettuce-epollEventLoop-4-1] TEST: refresh getNodes initialNodes [RedisMasterReplicaNode [redisURI=redis://localhost:6482?timeout=10s, role=UPSTREAM], RedisMasterReplicaNode [redisURI=redis://localhost:6483?timeout=10s, role=REPLICA]] (MasterReplicaTopologyRefresh:65) 2024-10-18 17:11:19 [ERROR] [lettuce-epollEventLoop-4-2] TEST: nodeList after ping [RedisMasterReplicaNode [redisURI=redis://localhost:6483?timeout=10s, role=REPLICA], RedisMasterReplicaNode [redisURI=redis://localhost:6482?timeout=10s, role=UPSTREAM]] (MasterReplicaTopologyRefresh:85) 2024-10-18 17:11:19 [ERROR] [main] TEST: +sdown 1 (MyTest:47) 2024-10-18 17:11:24 [ERROR] [lettuce-eventExecutorLoop-1-4] TEST: getNodes called count 2 (MasterReplicaTopologyRefresh:60) 2024-10-18 17:11:24 [ERROR] [lettuce-epollEventLoop-4-6] TEST: refresh getNodes initialNodes [RedisMasterReplicaNode [redisURI=redis://localhost:6482?timeout=10s, role=UPSTREAM], RedisMasterReplicaNode [redisURI=redis://localhost:6483?timeout=10s, role=REPLICA]] (MasterReplicaTopologyRefresh:65) 2024-10-18 17:11:24 [ERROR] [lettuce-epollEventLoop-4-8] TEST: inject requestPing failing (MasterReplicaTopologyRefresh:75) 2024-10-18 17:11:24 [ERROR] [lettuce-epollEventLoop-4-8] TEST: Requests node redis://localhost:6482?timeout=10s set future to null (Requests:63) 2024-10-18 17:11:24 [ERROR] [lettuce-epollEventLoop-4-8] TEST: Requests node redis://localhost:6483?timeout=10s set future to null (Requests:63) 2024-10-18 17:11:24 [ERROR] [lettuce-epollEventLoop-4-8] TEST: nodeList after ping [] (MasterReplicaTopologyRefresh:85) 2024-10-18 17:11:24 [WARN ] [lettuce-epollEventLoop-4-8] Topology refresh returned no nodes from redis-sentinel://127.0.0.1,127.0.0.1:26380?sentinelMasterId=mymaster&timeout=10s (SentinelConnector:126) 2024-10-18 17:11:29 [ERROR] [main] TEST: +sdown 2 (MyTest:50) 2024-10-18 17:11:30 [ERROR] [main] TEST: +sdown 3 (MyTest:53) 2024-10-18 17:11:34 [ERROR] [lettuce-eventExecutorLoop-1-1] TEST: getNodes called count 3 (MasterReplicaTopologyRefresh:60) 2024-10-18 17:11:34 [ERROR] [lettuce-epollEventLoop-4-9] TEST: refresh getNodes initialNodes [RedisMasterReplicaNode [redisURI=redis://localhost:6482?timeout=10s, role=UPSTREAM], RedisMasterReplicaNode [redisURI=redis://localhost:6483?timeout=10s, role=REPLICA]] (MasterReplicaTopologyRefresh:65) 2024-10-18 17:11:34 [ERROR] [lettuce-epollEventLoop-4-10] TEST: inject requestPing failing (MasterReplicaTopologyRefresh:75) 2024-10-18 17:11:34 [ERROR] [lettuce-epollEventLoop-4-10] TEST: Requests node redis://localhost:6482?timeout=10s set future to null (Requests:63) 2024-10-18 17:11:34 [ERROR] [lettuce-epollEventLoop-4-10] TEST: Requests node redis://localhost:6483?timeout=10s set future to null (Requests:63) 2024-10-18 17:11:34 [ERROR] [lettuce-epollEventLoop-4-10] TEST: nodeList after ping [] (MasterReplicaTopologyRefresh:85) 2024-10-18 17:11:34 [WARN ] [lettuce-epollEventLoop-4-10] Topology refresh returned no nodes from redis-sentinel://127.0.0.1,127.0.0.1:26380?sentinelMasterId=mymaster&timeout=10s (SentinelConnector:126) 2024-10-18 17:11:40 [ERROR] [main] TEST: known nodes [] (MyTest:58) [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 26.33 s <<< FAILURE! -- in io.lettuce.core.masterreplica.MyTest [ERROR] io.lettuce.core.masterreplica.MyTest.refreshEmptyNodes -- Time elapsed: 24.42 s <<< ERROR! io.lettuce.core.RedisException: Master is currently unknown: [] at io.lettuce.core.masterreplica.MasterReplicaConnectionProvider.getMaster(MasterReplicaConnectionProvider.java:305) at io.lettuce.core.masterreplica.MyTest.refreshEmptyNodes(MyTest.java:60) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at java.base/java.util.ArrayList.forEach(ArrayList.java:1541) at java.base/java.util.ArrayList.forEach(ArrayList.java:1541) ```
tishun commented 2 days ago

Hey @jiajunsu ,

Thanks for the verbose analysis. Before we jump into the way the Lettuce driver works let's first analyse the scenario and environment you are working with and why sentinel returns an empty list of nodes in the first place.

How many sentinel processes are you using to test this scenario? Can you describe your complete environment?

jiajunsu commented 9 hours ago

@tishun

  1. With 2 sentinel, 1 upstream and 1 replica nodes running.
  2. The empty list of nodes is not replied by sentinel, I've tested that possibility, and found that empty-list could not be responsed from sentinel(even if sentinel response empty list, lettuce code could not reach the WARN log Topology refresh returned no nodes).
    code in src/main/java/io/lettuce/core/masterreplica/MasterReplicaTopologyRefresh.java
    public Mono<List<RedisNodeDescription>> getNodes(RedisURI seed) {
        CompletableFuture<List<RedisNodeDescription>> future = topologyProvider.getNodesAsync();

        Mono<List<RedisNodeDescription>> initialNodes = Mono.fromFuture(future).doOnNext(nodes -> {
            applyAuthenticationCredentials(nodes, seed);
        });
        return initialNodes.map(this::getConnections)  // if initialNodes is empty, map `getConnections` will be skipped
                .flatMap(asyncConnections -> asyncConnections.asMono(seed.getTimeout(), eventExecutors))
                ...
    }

And the empty list was made from connections.requestPing(). The detail is:

  1. sentinel send +sdown to lettuce client.
  2. lettuce try to get nodes list from sentinel and success.
  3. lettuce try to PING the nodes, and at the same time, the redis nodes are unreachable from network.
  4. at last we get an empty list in SentinelConnector#getTopologyRefreshRunnable

This could be reproduced by running MyTest in my forked repo and commit. Just make the test environment by make prepare and make start, and the testcase use the sentinel in lettuce unittest environment running with port 26379 and 26380, while the upstream-replica ports are 6482 and 6483.