spring-projects / spring-data-redis

Provides support to increase developer productivity in Java when using Redis, a key-value store. Uses familiar Spring concepts such as a template classes for core API usage and lightweight repository style data access.
https://spring.io/projects/spring-data-redis/
Apache License 2.0
1.74k stars 1.15k forks source link

Redis Sentinel Failover bug #1952

Closed AlexeiZenin closed 3 years ago

AlexeiZenin commented 3 years ago

Overview: LettuceConnectionFactory configured by SB does not seem to switch over to new master leading to downtime unless application rebooted.

Observed in Spring Boot 2.2.6.RELEASE + Redis 6.0.

Reproduction Conditions

Summary: Causing the Redis master to hang by executing redis-cli -p <master_port> DEBUG sleep 9999999 to simulate a failure as mentioned in the official redis docs (https://redis.io/topics/sentinel#testing-the-failover) breaks Spring Data Redis by not being able to recover to the newly elected master.

When simply killing the Redis master process (SIGKILL), the failover is successful and Spring Data Redis is able to switch to the new master almost immediately.

Scenario 1 - Spring fails to recover to new master even when failover was completed by Sentinel

  1. Clone this repro repo: https://github.com/AlexeiZenin/spring-data-redis-example
  2. Run mvn clean package
  3. Run docker-compose up --build (Runs Bitnami Redis 6.0)
  4. A Redis Sentinel setup with a master and replica should be running, confirm by playing with the sample UI available at http://localhost:8092/
  5. Run redis-cli -p 8000 DEBUG sleep 99999999 in one terminal
  6. Now try to use the application or view the health endpoint (http://localhost:8092/actuator/health). Both should fail/not be functional (1 minute timeout errors should occur forever). Only fix is to restart the Spring boot app leading to downtime (when running in the cloud this happens automatically as the health endpoint causes your provider to kill all your apps).

In the applications logs should start seeing these messages when playing with the UI/actuator endpoint:

io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 minute(s)

One can confirm that the failover happened by viewing the logs of the sentinel:

...
1:X 26 Jan 2021 20:54:14.425 # +sdown master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.425 # +odown master mymaster 192.168.0.2 6379 #quorum 1/1
1:X 26 Jan 2021 20:54:14.425 # +new-epoch 1
1:X 26 Jan 2021 20:54:14.425 # +try-failover master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.433 # +vote-for-leader a107db8f0a1b25ab6791aef82c5b98d9c029b390 1
1:X 26 Jan 2021 20:54:14.433 # +elected-leader master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.433 # +failover-state-select-slave master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.518 # +selected-slave slave 192.168.0.3:6379 192.168.0.3 6379 @ mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.519 * +failover-state-send-slaveof-noone slave 192.168.0.3:6379 192.168.0.3 6379 @ mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:14.572 * +failover-state-wait-promotion slave 192.168.0.3:6379 192.168.0.3 6379 @ mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:15.473 # +promoted-slave slave 192.168.0.3:6379 192.168.0.3 6379 @ mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:15.473 # +failover-state-reconf-slaves master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:15.530 # +failover-end master mymaster 192.168.0.2 6379
1:X 26 Jan 2021 20:54:15.530 # +switch-master mymaster 192.168.0.2 6379 192.168.0.3 6379
1:X 26 Jan 2021 20:54:15.532 * +slave slave 192.168.0.2:6379 192.168.0.2 6379 @ mymaster 192.168.0.3 6379
1:X 26 Jan 2021 20:54:45.569 # +sdown slave 192.168.0.2:6379 192.168.0.2 6379 @ mymaster 192.168.0.3 6379

And confirming the state of the replica which is now the master:

> redis-cli -p 8001
127.0.0.1:8001> role
1) "master"
2) (integer) 208364
3) (empty array)
127.0.0.1:8001> 

And looking at the sentinel:

> redis-cli -p 26379
127.0.0.1:26379> role
1) "sentinel"
2) 1) "mymaster"
127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "192.168.0.3"
    5) "port"
    6) "6379"
    7) "runid"
    8) "54a76846f0d893918cbce1b6c544a396bc98e298"
    9) "flags"
   10) "master"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "904"
   19) "last-ping-reply"
   20) "904"
   21) "down-after-milliseconds"
   22) "30000"
   23) "info-refresh"
   24) "53"
   25) "role-reported"
   26) "master"
   27) "role-reported-time"
   28) "150785"
   29) "config-epoch"
   30) "1"
   31) "num-slaves"
   32) "1"
   33) "num-other-sentinels"
   34) "0"
   35) "quorum"
   36) "1"
   37) "failover-timeout"
   38) "180000"
   39) "parallel-syncs"
   40) "1"

Scenario 2 - Use docker kill to simulate failure, recovers successfully

Follow same steps as above except for step 5 do the following instead:

docker kill <container_id_of_master>

Where the container_id_of_master can be found through docker ps with the row that has redis in the Names column.

Once killed you will see that the application will recover and receive the sentinel update about the new master and then continue to function unlike Scenario 1.

Summary + Extra Context

mp911de commented 3 years ago

Spring Data Redis keeps long-lived connections when using Lettuce. As consequence, Sentinel is used to looking up the master node and connect to that node until the connection gets disconnected. Failovers do not have any effect unless the master crashes or the Redis server terminates the connection so that Lettuce attempts to reconnect. Upon reconnect, Lettuce uses Sentinel again to discover the master node.

When you want to ensure to follow changes in topology, then you need to configure ReadFrom. When using Spring Boot, you can achieve that by implementing LettuceClientConfigurationBuilderCustomizer.

Let us know whether you require further assistance or whether we can close this ticket.

AlexeiZenin commented 3 years ago

Spring Data Redis keeps long-lived connections when using Lettuce. As consequence, Sentinel is used to looking up the master node and connect to that node until the connection gets disconnected. Failovers do not have any effect unless the master crashes or the Redis server terminates the connection so that Lettuce attempts to reconnect. Upon reconnect, Lettuce uses Sentinel again to discover the master node.

When you want to ensure to follow changes in topology, then you need to configure ReadFrom. When using Spring Boot, you can achieve that by implementing LettuceClientConfigurationBuilderCustomizer.

Let us know whether you require further assistance or whether we can close this ticket.

Thank you for the quick reply and the potential fix for this issue. Is there an example that I could follow? I did not see this documented anywhere in the official documentation for Spring Data Redis (https://docs.spring.io/spring-data/data-redis/docs/current/reference/html/#redis:sentinel). Are there any performance implications in doing this?

In terms of this issue, I believe that this should automatically be handled by Spring Data Redis/Spring Boot (either through auto-configuration or a better failover detection algorithm). When one reads the official documentation which states: "For dealing with high-availability Redis, Spring Data Redis has support for Redis Sentinel, using RedisSentinelConfiguration", one does not think that the implementation would only work 66% of the time (2/3 failure scenarios covered) out of the box.

Possible solutions I see:

mp911de commented 3 years ago

I think the general misconception is what HA means. The understanding of highly-available describes the concept of reaching the server to issue Redis commands. Once the server gets disconnected and some sort of failover happens (server comes back up, topology changes), the application will resume operations.

Any manual interference isn't subject to these guarantees. You could argue in the same way that you change the master name and expect the infrastructure components to accommodate such change. The implementation needs to make certain assumptions and in a single-connection arrangement that assumption is to obtain a connection once and use it until it gets disconnected. Similar, long-lived Jedis won't react to a sentinel failover unless obtaining a new connection from the connection pool.

You can strip down Spring Data Redis to not use pooling nor long-lived connections to ensure you always query Sentinel for the active master node. In consequence, your topology view is always pretty recent for the price of creating short-lived connections.

AlexeiZenin commented 3 years ago

I think the general misconception is what HA means. The understanding of highly-available describes the concept of reaching the server to issue Redis commands. Once the server gets disconnected and some sort of failover happens (server comes back up, topology changes), the application will resume operations.

Any manual interference isn't subject to these guarantees. You could argue in the same way that you change the master name and expect the infrastructure components to accommodate such change. The implementation needs to make certain assumptions and in a single-connection arrangement that assumption is to obtain a connection once and use it until it gets disconnected. Similar, long-lived Jedis won't react to a sentinel failover unless obtaining a new connection from the connection pool.

You can strip down Spring Data Redis to not use pooling nor long-lived connections to ensure you always query Sentinel for the active master node. In consequence, your topology view is always pretty recent for the price of creating short-lived connections.

I completely agree with your definition of HA. Being able to use the correct server at anytime to not have downtime to users is critical.

There is no manual interference going on with what I witnessed in an AWS production environment. The above scenario I outlined merely replicates the scenario we saw happen in production which caused several minutes of downtime. The server process hung in a production environment and the HA guarantee by Spring Data Redis was violated as per the definition above (except Redis Sentinel actually detected it and recovered, while Spring Data Redis did not).

Given this, it seems there is a design oversight with the long-lived connection in that it cannot receive Sentinel updates after the connection is established. It seems to me and others I have spoken to that it is technically feasible to implement a "smarter" connection to be able to react to topology updates during the long-lived connection and reconnect to the new master if need be.

I see 2 approaches for a "smarter" long-lived connection:

mp911de commented 3 years ago

Taking a step back, Lettuce has a mode, where Sentinel is used to lookup the entire topology (master/replica) and routes command to the currently active master node by listening for topology changes. This mode is enabled when configuring ReadFrom. Since this mode comes with additional resource usage, it's not enabled by default.

mp911de commented 3 years ago

Since this ticket is not actionable and belongs more into the category of documentation, I'm closing this one.

If you would like us to look at this issue, please provide the additional information and we will re-open the issue.