redis / lettuce

Advanced Java Redis client for thread-safe sync, async, and reactive usage. Supports Cluster, Sentinel, Pipelining, and codecs.
https://lettuce.io
MIT License
5.35k stars 960 forks source link

Refresh Sentinel topology after full connectivity outage #2007

Open cprayer opened 2 years ago

cprayer commented 2 years ago

Bug Report

Current Behavior

Lettuce 6.1.6 RELEASE implementation of SentinelConnector updates sentinel topology when it is first connect or failover event is occurred.

Let's suppose failover event is occurred when all of sentinel psubscribe connection has been disconnected(because of network issue like network partition between lettuce client and sentinel, etc..)

There is no available sentinel psubscribe connection. So it has no idea to receive failover event.

After sentinel psubscribe connections are reconnected by ConnectionWatchDog, ConnectionProvider still has stale node information until new failover event is occurred. Because ConnectionWatchDog just reconnects sentinel connections and executes psubscribe * command.

Input Code

https://github.com/cprayer/lettuce-sentinel-topology-issue.git

Current configuration of StatefulRedisMasterReplicaConnection is below.

    @Bean
    public StatefulRedisMasterReplicaConnection<byte[], byte[]> masterReplicaConnection() {
        RedisClient redisClient = RedisClient.create();
        redisClient.setOptions(ClientOptions.builder()
                .socketOptions(SocketOptions.builder()
                        .keepAlive(SocketOptions.KeepAliveOptions.builder()
                                .idle(Duration.ofSeconds(3))
                                .interval(Duration.ofSeconds(3))
                                .count(3)
                                .build())
                        .build())
                .build());

        StatefulRedisMasterReplicaConnection<byte[], byte[]> connection = MasterReplica.connect(redisClient, ByteArrayCodec.INSTANCE, RedisURI.builder()
                        .withSentinel("127.0.0.1", 26379)
                        .withSentinelMasterId("mymaster")
                .build());

        connection.setReadFrom(ReadFrom.MASTER);
        return connection;
    }

It uses one sentinel connection for reproducing issue.

  1. Run jibDockerBuild gradle task
  2. Run docker run -p 8080:8080 -d demo:0.0.1-SNAPSHOT (if you are using m1, run docker run --platform linux/amd64 -p 8080:8080 -d demo:0.0.1-SNAPSHOT)
  3. Run curl 127.0.0.1:8080/stop (It stops redis sentinel that has 26379 port)
  4. If you found Connection refused: /127.0.0.1:26379 message in application log(docker logs ${CONTAINER_ID}), Run curl 127.0.0.1:8080/failover (It sends manual failover event to redis sentinel that has 26380 port, sentinel topology is not refreshed because there is no available sentinel psubscribe connection)
  5. Run curl 127.0.0.1:8080/start (It starts redis sentinel that has 26379 port, sentinel psubscribe connection is reconnected by ConnectionWatchDog)
  6. Run curl 127.0.0.1:8080 (It sends SET HELLO WORLD command to redis master, but it failed with READONLY You can't write against a read only replica. Because ConnectionProvider has stale node information)

Expected behavior/code

I think sentinel topology also should be refreshed when sentinel psubscribe connection is reconnected by connectionWatchDog.

Environment

mp911de commented 2 years ago

Topology discovery is based on Pub/Sub after the initial connect. The case above is pretty specific to an entire outage and we haven't considered that case. Instead, the underlying assumption is that we have connectivity to at least one Sentinel node.

Supposedly the most difficult bit could be identifying a full outage and then triggering a topology refresh upon connection activation.

rodrigo-tsuru commented 2 years ago

@cprayer , you can disable auto reconnection client.setOptions(ClientOptions.builder() .autoReconnect(false) .build()); and run validateConnection() as self-healing: lettuceConnectionFactory.validateConnection() to validate/recreate the connection.

Be aware of the side effects: https://github.com/spring-projects/spring-data-redis/issues/2179

tishun commented 1 month ago

Moving this to the Icebox as it is very specific. If you are hitting this please let us know so we can consider reprioritizing.