Redis Sentinel Failover bug

Overview: LettuceConnectionFactory configured by SB does not seem to switch over to new master leading to downtime unless application rebooted.

Observed in Spring Boot 2.2.6.RELEASE + Redis 6.0.

Reproduction Conditions

Summary: Causing the Redis master to hang by executing redis-cli -p <master_port> DEBUG sleep 9999999 to simulate a failure as mentioned in the official redis docs (https://redis.io/topics/sentinel#testing-the-failover) breaks Spring Data Redis by not being able to recover to the newly elected master.

When simply killing the Redis master process (SIGKILL), the failover is successful and Spring Data Redis is able to switch to the new master almost immediately.

Scenario 1 - Spring fails to recover to new master even when failover was completed by Sentinel

Clone this repro repo: https://github.com/AlexeiZenin/spring-data-redis-example
Run mvn clean package
Run docker-compose up --build (Runs Bitnami Redis 6.0)
A Redis Sentinel setup with a master and replica should be running, confirm by playing with the sample UI available at http://localhost:8092/
Run redis-cli -p 8000 DEBUG sleep 99999999 in one terminal
Now try to use the application or view the health endpoint (http://localhost:8092/actuator/health). Both should fail/not be functional (1 minute timeout errors should occur forever). Only fix is to restart the Spring boot app leading to downtime (when running in the cloud this happens automatically as the health endpoint causes your provider to kill all your apps).

In the applications logs should start seeing these messages when playing with the UI/actuator endpoint:

io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 minute(s)

One can confirm that the failover happened by viewing the logs of the sentinel:

...
1:X 26 Jan 2021 20:54:14.425 # +sdown master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.425 # +odown master mymaster 192.168.0.2 6379 #quorum 1/1
1:X 26 Jan 2021 20:54:14.425 # +new-epoch 1
1:X 26 Jan 2021 20:54:14.425 # +try-failover master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.433 # +vote-for-leader a107db8f0a1b25ab6791aef82c5b98d9c029b390 1
1:X 26 Jan 2021 20:54:14.433 # +elected-leader master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.433 # +failover-state-select-slave master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.518 # +selected-slave slave 192.168.0.3:6379 192.168.0.3 6379 @ mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.519 * +failover-state-send-slaveof-noone slave 192.168.0.3:6379 192.168.0.3 6379 @ mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.572 * +failover-state-wait-promotion slave 192.168.0.3:6379 192.168.0.3 6379 @ mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:15.473 # +promoted-slave slave 192.168.0.3:6379 192.168.0.3 6379 @ mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:15.473 # +failover-state-reconf-slaves master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:15.530 # +failover-end master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:15.530 # +switch-master mymaster 192.168.0.2 6379 192.168.0.3 6379
1:X 26 Jan 2021 20:54:15.532 * +slave slave 192.168.0.2:6379 192.168.0.2 6379 @ mymaster 192.168.0.3 6379
1:X 26 Jan 2021 20:54:45.569 # +sdown slave 192.168.0.2:6379 192.168.0.2 6379 @ mymaster 192.168.0.3 6379

And confirming the state of the replica which is now the master:

> redis-cli -p 8001
127.0.0.1:8001> role
1) "master"
2) (integer) 208364
3) (empty array)
127.0.0.1:8001>

And looking at the sentinel:

> redis-cli -p 26379
127.0.0.1:26379> role
1) "sentinel"
2) 1) "mymaster"
127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "192.168.0.3"
    5) "port"
    6) "6379"
    7) "runid"
    8) "54a76846f0d893918cbce1b6c544a396bc98e298"
    9) "flags"
   10) "master"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "904"
   19) "last-ping-reply"
   20) "904"
   21) "down-after-milliseconds"
   22) "30000"
   23) "info-refresh"
   24) "53"
   25) "role-reported"
   26) "master"
   27) "role-reported-time"
   28) "150785"
   29) "config-epoch"
   30) "1"
   31) "num-slaves"
   32) "1"
   33) "num-other-sentinels"
   34) "0"
   35) "quorum"
   36) "1"
   37) "failover-timeout"
   38) "180000"
   39) "parallel-syncs"
   40) "1"

Scenario 2 - Use `docker kill` to simulate failure, recovers successfully

Follow same steps as above except for step 5 do the following instead:

docker kill <container_id_of_master>

Where the container_id_of_master can be found through docker ps with the row that has redis in the Names column.

Once killed you will see that the application will recover and receive the sentinel update about the new master and then continue to function unlike Scenario 1.

Summary + Extra Context

It seems that in one case Spring Data Redis recovers from the failover and is able to connect to the new master that the Sentinel elects, whereas in another case it completely fails to do so.
Scenario 1 causes downtime unless the application is restarted. In Cloud environments this can happen all at once where every SB app's health check will simultaneously report DOWN, causing your whole service to restart. This can lead to minutes of downtime until all your containers restart and pickup the new master configuration.
Some extra context specific to AWS is that we noticed this happen when one of our EC2 nodes became non-functional (had to remove the corrupt instance). It is interesting to note that we could not SSH or connect to the instance at all, which is oddly similar in nature to Scenario 1's SLEEP behaviour.

Spring Data Redis keeps long-lived connections when using Lettuce. As consequence, Sentinel is used to looking up the master node and connect to that node until the connection gets disconnected. Failovers do not have any effect unless the master crashes or the Redis server terminates the connection so that Lettuce attempts to reconnect. Upon reconnect, Lettuce uses Sentinel again to discover the master node.

When you want to ensure to follow changes in topology, then you need to configure ReadFrom. When using Spring Boot, you can achieve that by implementing LettuceClientConfigurationBuilderCustomizer.

Let us know whether you require further assistance or whether we can close this ticket.

Spring Data Redis keeps long-lived connections when using Lettuce. As consequence, Sentinel is used to looking up the master node and connect to that node until the connection gets disconnected. Failovers do not have any effect unless the master crashes or the Redis server terminates the connection so that Lettuce attempts to reconnect. Upon reconnect, Lettuce uses Sentinel again to discover the master node.

When you want to ensure to follow changes in topology, then you need to configure ReadFrom. When using Spring Boot, you can achieve that by implementing LettuceClientConfigurationBuilderCustomizer.

Let us know whether you require further assistance or whether we can close this ticket.

Thank you for the quick reply and the potential fix for this issue. Is there an example that I could follow? I did not see this documented anywhere in the official documentation for Spring Data Redis (https://docs.spring.io/spring-data/data-redis/docs/current/reference/html/#redis:sentinel). Are there any performance implications in doing this?

In terms of this issue, I believe that this should automatically be handled by Spring Data Redis/Spring Boot (either through auto-configuration or a better failover detection algorithm). When one reads the official documentation which states: "For dealing with high-availability Redis, Spring Data Redis has support for Redis Sentinel, using RedisSentinelConfiguration", one does not think that the implementation would only work 66% of the time (2/3 failure scenarios covered) out of the box.

Possible solutions I see:

Add a disclaimer to the Redis Sentinel Support section of the official documentation (https://docs.spring.io/spring-data/data-redis/docs/current/reference/html/#redis:sentinel) that the out of the box implementation is not production ready as it only covers 66% of failover conditions and requires a manual addition of code to get it to production level standards.
Refactor the algorithm for interacting with Redis Sentinel to guarantee updates to master information during failover for all scenarios including the one where the master hangs (which is displayed as the main way to test a failover on the official Redis website, something I thought Spring Data Redis would definitely support: https://redis.io/topics/sentinel#testing-the-failover)

I think the general misconception is what HA means. The understanding of highly-available describes the concept of reaching the server to issue Redis commands. Once the server gets disconnected and some sort of failover happens (server comes back up, topology changes), the application will resume operations.

Any manual interference isn't subject to these guarantees. You could argue in the same way that you change the master name and expect the infrastructure components to accommodate such change. The implementation needs to make certain assumptions and in a single-connection arrangement that assumption is to obtain a connection once and use it until it gets disconnected. Similar, long-lived Jedis won't react to a sentinel failover unless obtaining a new connection from the connection pool.

You can strip down Spring Data Redis to not use pooling nor long-lived connections to ensure you always query Sentinel for the active master node. In consequence, your topology view is always pretty recent for the price of creating short-lived connections.

I think the general misconception is what HA means. The understanding of highly-available describes the concept of reaching the server to issue Redis commands. Once the server gets disconnected and some sort of failover happens (server comes back up, topology changes), the application will resume operations.

Any manual interference isn't subject to these guarantees. You could argue in the same way that you change the master name and expect the infrastructure components to accommodate such change. The implementation needs to make certain assumptions and in a single-connection arrangement that assumption is to obtain a connection once and use it until it gets disconnected. Similar, long-lived Jedis won't react to a sentinel failover unless obtaining a new connection from the connection pool.

You can strip down Spring Data Redis to not use pooling nor long-lived connections to ensure you always query Sentinel for the active master node. In consequence, your topology view is always pretty recent for the price of creating short-lived connections.

I completely agree with your definition of HA. Being able to use the correct server at anytime to not have downtime to users is critical.

There is no manual interference going on with what I witnessed in an AWS production environment. The above scenario I outlined merely replicates the scenario we saw happen in production which caused several minutes of downtime. The server process hung in a production environment and the HA guarantee by Spring Data Redis was violated as per the definition above (except Redis Sentinel actually detected it and recovered, while Spring Data Redis did not).

Given this, it seems there is a design oversight with the long-lived connection in that it cannot receive Sentinel updates after the connection is established. It seems to me and others I have spoken to that it is technically feasible to implement a "smarter" connection to be able to react to topology updates during the long-lived connection and reconnect to the new master if need be.

I see 2 approaches for a "smarter" long-lived connection:

Subscribe to Redis Sentinel updates in a background thread, create a construct that leased connections can check before issuing any command if they need to be "recycled" to point to a new master
Do the same approach as above but instead poll the Sentinel for the status of the master (expose timing interval via config)

Taking a step back, Lettuce has a mode, where Sentinel is used to lookup the entire topology (master/replica) and routes command to the currently active master node by listening for topology changes. This mode is enabled when configuring ReadFrom. Since this mode comes with additional resource usage, it's not enabled by default.

Since this ticket is not actionable and belongs more into the category of documentation, I'm closing this one.

If you would like us to look at this issue, please provide the additional information and we will re-open the issue.

spring-projects / spring-data-redis