Open SeungWook-Han opened 3 years ago
Thanks for reaching out. I don't think there's a reason that Lettuce should implement a circuit breaker. Instead, circuit breakers should be part of the application or a composition library such as Project Reactor.
Totally agree with @mp911de point. I personally prefer to use resilience4j for that purposes in my projects on top of reactive or kotlin coroutines APIs. It also has ratelimiting, retrying and other useful features.
Thanks for the insights. Closing as we don't plan to add such a feature.
When only one of hundreds or thounsands Redis hosts is dead status for tens of several seconds during its failover, or there was some hotspots in requests, "the circuit breaker per (Redis) node" is quite useful instead of waiting few seconds timeouts to keep constant few milliseconds latency to end users. In the situation, applying the circuit breaker to whole Redis client could make more and more failures than applying "the circuit breaker per node" approach if 99.9% of Redis nodes are healthy.
We (LINE) heavily use "the circuit breaker per Redis shard" with Lettuce and Jedis for in-house Redis clusters that cluster structure is in ZooKeeper with client-sharding, and because our internal code know the cluster structure, and the client could throws an exception before calling actual Lettuce/Jedis methods when target node is in dead status in client's in-memory data.
However, with Redis "Cluster" mode, user code could know which key is in which slot, but user code doesn't know which slot is in which node even if it's in Lettuce's memory. So, it would be nice if user code could get Lettuce's internal slot-node mapping as read-only status frequently, or if you allow adding this feature to use per-node approach with the circuit breaker for Redis I/Os. What do you think?
Well, actually I think it is better to hide redis-node information from user. Because allowing user to know only the concept of Slot without needing to know the redis-node is suitable for the purpose of Redis cluster. In order not to expose redis-node information to users and but implement per-node circuit breaker feature, it is better to implement CircuitBreaker feature inside the client, that is, Lettuce.
Isn't reject commands on disconnected what you're looking for? I agree that queuing and timeout may hurt performance. With reject on disconnect, you can fail fast and catch RedisException
. If that helps, we could even introduce a more specific exception type.
Thanks for reply.
As you explained, with reject on disconnect, the request might be failed fastly.
However the state of circuit should be exist on each redis node, so in order to manage it an application should know which redis node is under unavailable state by RedisException
.
I wonder that it is worth to expose the redis node to an application.
I will reopen the ticket to foster discussion around that topic. It has a certain aspect of usefulness but I think it requires a few more iterations before we arrive at a point that would be actionable in some way.
I think a benefit of supporting circuit breaker is quite clear. As you already mentioned, the queuing that an intensive request will make situaion worse can be alleviated by circuit-breaker.
And it is difficult to expect the same effect as Circuitbreaker with only the exception due to Disconnect. This is because you may want to fail quickly when you set the timeout long, but when the Redis Node is not in a good state. Circuitbreaker allows you to determine how long the FailFast state will be kept, allowing a finer configuration to lettuce.
Generally speaking, you can already obtain flags (e.g. failed node) from Partitions
and verify whether a node connection is established by checking StatefulConnection.isOpen()
. That should allow you to orchestrate Lettuce externally.
Thanks for guide. Please give me couple days to check the circiuitbreaker is implementable with existing interface.
I also have this problem, when a node down, the application will be block a few time until lettuce trigger refresh.
I recognize this is an old issue, but I'd like to add a little more information and respond to a couple points from earlier:
Isn't reject commands on disconnected what you're looking for? I agree that queuing and timeout may hurt performance. With reject on disconnect, you can fail fast and catch RedisException. If that helps, we could even introduce a more specific exception type.
Generally speaking, you can already obtain flags (e.g. failed node) from Partitions and verify whether a node connection is established by checking StatefulConnection.isOpen().
I think these are both valid points, but lately, we've been dealing with irritating cases lately where our upstream provider has some kind of network hiccup and we see a sharp spike in latency when communicating with a single shard in a larger cluster.
In our setup, we have a number of application servers hosted by a major cloud provider, and that cloud provider also provides a managed Redis service through which we have created a medium-to-large-sized cluster. When we see latency spikes to a single shard in that managed cluster, the connection remains open and appears healthy except for the latency spikes. When this happens, requests to that shard begin timing out for some amount of time—perhaps on the order of a minute.
Using an external circuit breaker allows us to fail fast in the case where the connection is ~healthy~ open, but requests are timing out regardless. The trouble is that we currently need to apply the breaker to the whole cluster even if only a small part of it is experiencing some kind of distress. That means that all requests to the cluster fail for some time when really only a small percentage ought to fail. Having some mechanism to put a breaker on each shard would make an enormous difference in our case.
My main point here: failing fast on connection state is good in many cases, but there are also (very real, non-hypothetical, happening-right-now!) cases that connection state won't address.
Thanks for your consideration!
For the time being, a workaround can be adding a ChannelHandler
in front of CommandHandler
via NettyCustomizer
. The main advantage is that a ChannelHandler
doesn't depend on the Redis setup (Standalone vs. Cluster), you can wrap RedisCommand
's and also reject these instead of writing commands to the transport.
Probably that would be a neat hook for any kind of circuit breakers as the underlying channel is isolated from all other cluster members. However, the disconnected command buffer of the endpoint needs to be considered, ideally by rejecting commands if the connection is disconnected.
For the time being, a workaround can be adding a
ChannelHandler
in front ofCommandHandler
viaNettyCustomizer
.
Interesting! We'll look into this—thank you for the suggestion!
For the curious, we’ve implemented a circuit breaker using resilience4j, following @mp911de’s suggestion.
There are two things to note:
isConnected()
check—we didn’t want to close the channel for only transient disruptionsDefaultEndpoint
, and this results in a lot of logs when the breaker is open. We considered wrapping it in ClosedChannelException
, but this seems to be a violation of the Channel
state machine semantics, since the circuit breaker condition might not be a permanent closure.the exception is not expected by
DefaultEndpoint
For this, I‘ve filed #2830.
Feature Request
Hello. Lettuce doesn't have CircuitBreaker feature, so I'm going to add it. Before I start working, I want to ask you one question. Have you ever considered applying CircuitBreaker before? If so, is there any reason why you didn't apply it at that time? If you let me know that, it will be helpful when implementing the CircuirBreaker feature.