Open be-hase opened 4 years ago
Thanks for report. We need to fix the timeout issue when a sentinel connection fails.
After looking in this issue, the problem arises from the fact that the connect was successful but Sentinel failed to respond within the timeout. The client code assumes that when the connection was established, Sentinel is functional. At the time we query sentinel we no longer have access to the connection progress (i.e. which hosts were tried, which failed and so on) as we operate on an existing connection.
The entire mechanism is asynchronous and therefore it imposes a certain complexity to fix the issue properly. For now, please enable PING on connect
via ClientOptions
(ClientOptions.builder().pingBeforeActivateConnection(true).build()
). What this does is issuing a PING command during the connect phase to ensure that Redis responds properly. We get the guarantee that at least at the time the connection gets created the Sentinel is alive. Unhealthy/unresponsive nodes are skipped and we increase the chance of hitting a sentinel node that is able to properly reply with the master address.
Thank you for the detailed investigation. Keep pingBeforeActivateConnection enabled until it is fixed. (It seems to be difficult to fix because it is an asynchronous mechanism...)
Bug Report
Current Behavior & Input Code
My product uses sentinel's master node discovery.
https://github.com/lettuce-io/lettuce-core/wiki/Redis-Sentinel#sentinel.redis-discovery-using-redis-sentinel
The other day, the redis sentinel node(VM) became unable to return a response due to a hypervisor failure, and a timeout error began to occur. I expected lettuce to do the failover. However, failover had failed. Finally, when the VM of the redis sentinel node shut down completely, failover succeeded.
(On the redis and redis sentinel logs, the failover was successful immediately.)
I did various tests. When the sentinel node is completely down and the connection is broken, the failover succeeds. However, if the timeout error occurs without the sentinel node going down completely, failover does not seem to be possible.
Take application start as an example. In the case of such a spring boot app, lettuce skips the first unreachable sentinel node and starts the application properly.
However, when a timeout error occurs, an error occurs and it cannot be started. Why not transfer the processing to the next sentinel node?
(Timeout error is generated by using toxiproxy.)
I also tried it after the application started. Use toxiproxy so that a timeout error occurs on the sentinel node to which lettuce is connected. At this time, bring down the master node. Then, failover fails and the application tries to connect to the master node that has gone down forever.
On the other hand, when the sendinel node goes down and a connect error occurs, it seems that the master node is found from the following sentinel node and the failover succeeds.
Expected behavior/code
If a timeout error occurs for the sentinel node, use the following sentinel node. (There is no problem with connect error.)
However, it still seems to work around if we set pingBeforeActivateConnection to true. I worked around by enabling this setting.
Environment
Possible Solution
Additional context