Open jiajunsu opened 5 days ago
Here is the modification to reproduce this issue.
https://github.com/redis/lettuce/commit/01f96a677b9036bfd04afb8c392c7f32cc1f49c4
Hey @jiajunsu ,
Thanks for the verbose analysis. Before we jump into the way the Lettuce driver works let's first analyse the scenario and environment you are working with and why sentinel returns an empty list of nodes in the first place.
How many sentinel processes are you using to test this scenario? Can you describe your complete environment?
@tishun
Topology refresh returned no nodes
).
public Mono<List<RedisNodeDescription>> getNodes(RedisURI seed) {
CompletableFuture<List<RedisNodeDescription>> future = topologyProvider.getNodesAsync();
Mono<List<RedisNodeDescription>> initialNodes = Mono.fromFuture(future).doOnNext(nodes -> {
applyAuthenticationCredentials(nodes, seed);
});
return initialNodes.map(this::getConnections) // if initialNodes is empty, map `getConnections` will be skipped
.flatMap(asyncConnections -> asyncConnections.asMono(seed.getTimeout(), eventExecutors))
...
}
And the empty list was made from connections.requestPing()
. The detail is:
+sdown
to lettuce client.PING
the nodes, and at the same time, the redis nodes are unreachable from network. SentinelConnector#getTopologyRefreshRunnable
This could be reproduced by running MyTest
in my forked repo and commit. Just make the test environment by make prepare
and make start
, and the testcase use the sentinel in lettuce unittest environment running with port 26379
and 26380
, while the upstream-replica ports are 6482
and 6483
.
Bug Report
Current Behavior
In redis sentinel mode, lettuce may refresh topology nodes to empty, if the connection to redis sentinel closed just after lettuce received TopologyRefreshMessage. And that will cause lettuce cannot recover anymore until received next TopologyRefreshMessage.
We have two redis nodes, and redis is running in sentinel mode. Assume the redis nodes' names are redis1 and redis2, and redis1 is master at the beginning, we inject errors as below:
At step3, redis sentinel send
+sdown sentinel ...
to lettuce client, and trigger lettuce executing methodSentinelConnector::getTopologyRefreshRunnable
. While at the same time, the connection between lettuce client and redis sentinel is closed, and then error occured.lettuce logs below
And since that, the app could not handle any read and write operation because the knownNodes is empty.
By reviewing the git log, we've found this problem existed since commit Do not fail Master/Slave topology refresh if a node is not available. And it still exists at branch master;
When
Requests#getRequest
returnnull
orfuture.isDone()
returns false, it'll clearMasterReplicaConnectionProvider#knownNodes
and the lettuce client could not get recovered untill receiving next TopologyRefreshMessage, or the lettuce client process was restarted.src/main/java/io/lettuce/core/masterreplica/Requests.java
```java protected void onEmit(Emission> emission) { List result = new ArrayList<>();
Map latencies = new HashMap<>();
for (RedisNodeDescription node : nodes) {
TimedAsyncCommand future = getRequest(node.getUri());
if (future == null || !future.isDone()) {
// if this expression is true for all nodes, it'll clear MasterReplicaConnectionProvider#knownNodes
continue;
}
RedisNodeDescription redisNodeDescription = findNodeByUri(nodes, node.getUri());
latencies.put(redisNodeDescription, future.duration());
result.add(redisNodeDescription);
}
SortAction sortAction = SortAction.getSortAction();
sortAction.sort(result, new LatencyComparator(latencies));
emission.success(result);
}
```
Environment
Possible Solution
We've checked the lettuce code, the code in
SentinelConnector::getTopologyRefreshRunnable
may be improved, to avoid setting empty list to knownNodes.src/main/java/io/lettuce/core/masterreplica/SentinelConnector.java
```java private Runnable getTopologyRefreshRunnable(MasterReplicaTopologyRefresh refresh, MasterReplicaConnectionProvider