redis / lettuce

Advanced Java Redis client for thread-safe sync, async, and reactive usage. Supports Cluster, Sentinel, Pipelining, and codecs.
https://lettuce.io
MIT License
5.36k stars 960 forks source link

CircuitBreaker #1530

Open SeungWook-Han opened 3 years ago

SeungWook-Han commented 3 years ago

Feature Request

Hello. Lettuce doesn't have CircuitBreaker feature, so I'm going to add it. Before I start working, I want to ask you one question. Have you ever considered applying CircuitBreaker before? If so, is there any reason why you didn't apply it at that time? If you let me know that, it will be helpful when implementing the CircuirBreaker feature.

mp911de commented 3 years ago

Thanks for reaching out. I don't think there's a reason that Lettuce should implement a circuit breaker. Instead, circuit breakers should be part of the application or a composition library such as Project Reactor.

sokomishalov commented 3 years ago

Totally agree with @mp911de point. I personally prefer to use resilience4j for that purposes in my projects on top of reactive or kotlin coroutines APIs. It also has ratelimiting, retrying and other useful features.

mp911de commented 3 years ago

Thanks for the insights. Closing as we don't plan to add such a feature.

jongyeol commented 3 years ago

When only one of hundreds or thounsands Redis hosts is dead status for tens of several seconds during its failover, or there was some hotspots in requests, "the circuit breaker per (Redis) node" is quite useful instead of waiting few seconds timeouts to keep constant few milliseconds latency to end users. In the situation, applying the circuit breaker to whole Redis client could make more and more failures than applying "the circuit breaker per node" approach if 99.9% of Redis nodes are healthy.

We (LINE) heavily use "the circuit breaker per Redis shard" with Lettuce and Jedis for in-house Redis clusters that cluster structure is in ZooKeeper with client-sharding, and because our internal code know the cluster structure, and the client could throws an exception before calling actual Lettuce/Jedis methods when target node is in dead status in client's in-memory data.

However, with Redis "Cluster" mode, user code could know which key is in which slot, but user code doesn't know which slot is in which node even if it's in Lettuce's memory. So, it would be nice if user code could get Lettuce's internal slot-node mapping as read-only status frequently, or if you allow adding this feature to use per-node approach with the circuit breaker for Redis I/Os. What do you think?

SeungWook-Han commented 3 years ago

Well, actually I think it is better to hide redis-node information from user. Because allowing user to know only the concept of Slot without needing to know the redis-node is suitable for the purpose of Redis cluster. In order not to expose redis-node information to users and but implement per-node circuit breaker feature, it is better to implement CircuitBreaker feature inside the client, that is, Lettuce.

mp911de commented 3 years ago

Isn't reject commands on disconnected what you're looking for? I agree that queuing and timeout may hurt performance. With reject on disconnect, you can fail fast and catch RedisException. If that helps, we could even introduce a more specific exception type.

SeungWook-Han commented 3 years ago

Thanks for reply. As you explained, with reject on disconnect, the request might be failed fastly. However the state of circuit should be exist on each redis node, so in order to manage it an application should know which redis node is under unavailable state by RedisException. I wonder that it is worth to expose the redis node to an application.

mp911de commented 3 years ago

I will reopen the ticket to foster discussion around that topic. It has a certain aspect of usefulness but I think it requires a few more iterations before we arrive at a point that would be actionable in some way.

SeungWook-Han commented 3 years ago

I think a benefit of supporting circuit breaker is quite clear. As you already mentioned, the queuing that an intensive request will make situaion worse can be alleviated by circuit-breaker.

And it is difficult to expect the same effect as Circuitbreaker with only the exception due to Disconnect. This is because you may want to fail quickly when you set the timeout long, but when the Redis Node is not in a good state. Circuitbreaker allows you to determine how long the FailFast state will be kept, allowing a finer configuration to lettuce.

mp911de commented 3 years ago

Generally speaking, you can already obtain flags (e.g. failed node) from Partitions and verify whether a node connection is established by checking StatefulConnection.isOpen(). That should allow you to orchestrate Lettuce externally.

SeungWook-Han commented 3 years ago

Thanks for guide. Please give me couple days to check the circiuitbreaker is implementable with existing interface.

liuluo129 commented 1 year ago

I also have this problem, when a node down, the application will be block a few time until lettuce trigger refresh.

jon-signal commented 7 months ago

I recognize this is an old issue, but I'd like to add a little more information and respond to a couple points from earlier:

Isn't reject commands on disconnected what you're looking for? I agree that queuing and timeout may hurt performance. With reject on disconnect, you can fail fast and catch RedisException. If that helps, we could even introduce a more specific exception type.

Generally speaking, you can already obtain flags (e.g. failed node) from Partitions and verify whether a node connection is established by checking StatefulConnection.isOpen().

I think these are both valid points, but lately, we've been dealing with irritating cases lately where our upstream provider has some kind of network hiccup and we see a sharp spike in latency when communicating with a single shard in a larger cluster.

In our setup, we have a number of application servers hosted by a major cloud provider, and that cloud provider also provides a managed Redis service through which we have created a medium-to-large-sized cluster. When we see latency spikes to a single shard in that managed cluster, the connection remains open and appears healthy except for the latency spikes. When this happens, requests to that shard begin timing out for some amount of time—perhaps on the order of a minute.

Using an external circuit breaker allows us to fail fast in the case where the connection is ~healthy~ open, but requests are timing out regardless. The trouble is that we currently need to apply the breaker to the whole cluster even if only a small part of it is experiencing some kind of distress. That means that all requests to the cluster fail for some time when really only a small percentage ought to fail. Having some mechanism to put a breaker on each shard would make an enormous difference in our case.

My main point here: failing fast on connection state is good in many cases, but there are also (very real, non-hypothetical, happening-right-now!) cases that connection state won't address.

Thanks for your consideration!

mp911de commented 6 months ago

For the time being, a workaround can be adding a ChannelHandler in front of CommandHandler via NettyCustomizer. The main advantage is that a ChannelHandler doesn't depend on the Redis setup (Standalone vs. Cluster), you can wrap RedisCommand's and also reject these instead of writing commands to the transport.

Probably that would be a neat hook for any kind of circuit breakers as the underlying channel is isolated from all other cluster members. However, the disconnected command buffer of the endpoint needs to be considered, ideally by rejecting commands if the connection is disconnected.

jon-signal commented 6 months ago

For the time being, a workaround can be adding a ChannelHandler in front of CommandHandler via NettyCustomizer.

Interesting! We'll look into this—thank you for the suggestion!

eager-signal commented 5 months ago

For the curious, we’ve implemented a circuit breaker using resilience4j, following @mp911de’s suggestion.

There are two things to note:

eager-signal commented 5 months ago

the exception is not expected by DefaultEndpoint

For this, I‘ve filed #2830.