resilience4j / resilience4j

Resilience4j is a fault tolerance library designed for Java8 and functional programming
Apache License 2.0
9.73k stars 1.33k forks source link

Circuit breaker stuck in HALF_OPEN state #935

Closed bankuruprasannakumar closed 4 years ago

bankuruprasannakumar commented 4 years ago

Resilience4j version:0.15.0

Java version:11

Circuit breaker was stuck in HALF_OPEN state on one of production node. Not able to replicate this. Is there any known scenario in which this can happen and can we replicate this? https://github.com/resilience4j/resilience4j/issues/903 seems to be speaking on the similar lines.

RobWin commented 4 years ago

Hi, you are using an old version of Resiliencej4. There have been many bug fixes and improvements in the meantime.

Can you please provide more context. How do you use Resilience4j? Do you use a reactive operator?

KrnSaurabh commented 4 years ago

@RobWin Do we need to investigate something here or any help needed ?

RobWin commented 4 years ago

You could provide me some more context. How do you use Resilience4j? Do you have some logs? Do you catch exceptions in your code instead of propagating them?

KrnSaurabh commented 4 years ago

Hi @RobWin , sorry I didn't face this issue or reported it so I have no more context. I just volunteered to help since it was tagged as help wanted.

RobWin commented 4 years ago

Ah, thank you, We can think of a better solution how to solve the challenge that only a certain amount of calls should be permitted in HALF_OPEN state. See https://github.com/resilience4j/resilience4j/blob/master/resilience4j-circuitbreaker/src/main/java/io/github/resilience4j/circuitbreaker/internal/CircuitBreakerStateMachine.java#L956-L976

The concept of using an AtomicInteger counter does not seem to work perfectly, if a call is permitted but never seems to finish either successfully or with a failure. And the permit is also not canceled/released. In that case the CircuitBreaker is stuck in HALF_OPEN state. But this points to an issue in the application itself.

KrnSaurabh commented 4 years ago

Hi @RobWin if a call never responds back with a success or failure how CircuitBreaker should treat those calls ? state transition takes place only based on aggregated outcome of the calls if completed successfully or failed or failed with ignored exception. Should CircuitBreaker ignore those calls who never responds with success or failure and release permission ?

Currently CB infinitely waits for those stuck call to complete and hence possible to get stuck in Half Open state, can we put any timeout or wait duration on Half Open state as well, similar to waitDurationInOpenState so that once the timeout is over we aggregate the calls outcome marking incomplete calls as success, failure or ignored and make a state transition to Open or Close state.

RobWin commented 4 years ago

Thats a good question. 1) We introduce a new configuration parameter that the test calls in HALF_OPEN state must happen in a certain period of time. 2) We have to introduce a timeout interval per test call and if the test calls times out, we have to increase the AtomicInteger again. But then we have to track all test calls. That sucks :(

KrnSaurabh commented 4 years ago

Yes agree. so I guess we can call off option 2. Can we think of a new configuration parameter that the test calls in HALF_OPEN state must happen in a certain period of time and a default value for it ?

RobWin commented 4 years ago

We can also think about implementing our own data structure. The data structure has initially a fixed amount of N free tokens. Every permission reduces the amount of free tokens and increases the amount of used tokens. The used tokens are stored in a hash map that is sorted by time. Tokens with timestamps beyond a threshold are freed up again. Whenever a new permission request comes in, the data structure checks if a token can be freed up again, but only if all free tokens are used.. If a token was used, that means the call was recorded as asuccess or failure, it cannot be reused again and is deleted from the hashmap of used token.

What do you think? I think it's better than restricing the time a CircuitBreaker is allowed to be in HALF_OPEN state.

KrnSaurabh commented 4 years ago

@RobWin If I understand correctly, In any case we need to define the two things :

KrnSaurabh commented 4 years ago

Hi @RobWin would like to know your suggestion on above. Thanks.

RobWin commented 4 years ago

Sry, I had vacation and had to look after my children.

RobWin commented 4 years ago

I just don't understand how calls can get lost :( I just means the application code (of the user) is not using any timeouts and some thread must get stuck.

KrnSaurabh commented 4 years ago

Yes right, application code must have not any mechanism to deal with unresponsive calls that resulted in stuck Half open state. I was wondering that the same problem could occur even when circuit is in closed state as it will allow all the calls and wait for their outcome infinitely. Let me try with an implementation as per what we have discussed and I will get back in case of further query.

hexmind commented 4 years ago

@KrnSaurabh @RobWin Could you remind me why waitDurationInHalfOpenState idea was rejected? In that simple solution each call in HalfOpenState would verify passage "waitDuration" on entry, before potentially unfinished execution.

RobWin commented 4 years ago

It just feels strange to restrict the HALF_OPEN state to a period of time. Let's say you configure waitDurationInHalfOpenState to 1 Minute. What happens to requests which are processed shortly before the period ends? Should we mark them as success or failure? It feels wrong to mark them as a failure/success without knowing the real result. This would change the behavior of the CircuitBreaker. If you set waitDurationInHalfOpenState to a larger value like 5 Minutes and a request is stuck, the CB might be stuck multiple minutes until the timeout duration is elapsed. A request timeout value per call feels "more correct", or?

hexmind commented 4 years ago

There is how I see it:

RobWin commented 4 years ago

The downside I still see is that, if you set waitDurationInHalfOpenState to a larger value like 5 Minutes and a request is stuck, the CB might be stuck multiple minutes until the timeout duration is elapsed.

hexmind commented 4 years ago

if you set waitDurationInHalfOpenState to a larger value (for some reason) it will work according to a larger value (as expected). Even getting stuck won't be a problem because a first request started after a time limit triggers a state transition (or earlier in case of permittedNumberOfCallsInHalfOpenState exhaustion).

roja-polukonda commented 2 years ago

Does this problem exists only in HALF_OPEN state? I experienced similar issue in OPEN_STATE and it is struck for hours in there.

RobWin commented 2 years ago

@rojapolukonda Nothing has been reported so far. Have you set automaticTransitionFromOpenToHalfOpenEnabled = true ? Otherwise the CircuitBreaker is by design always OPEN until the next request is processed.

roja-polukonda commented 2 years ago

@RobWin yes, automaticTransitionFromOpenToHalfOpen is enabled already configurations used: (considering waitTime in Half_open state picks priority) waitDurationInHalfOpenState( 20 sec) permittedNumberOfCallsInHalfOpenState (100) , slidingWindow(100), failureThreshold(50) fyi, I use circuit breaker on reactor operators

RobWin commented 2 years ago

I experienced similar issue in OPEN_STATE and it is struck for hours in there.

Did it change after some time or what did you do to solve it?

roja-polukonda commented 2 years ago

@RobWin Unfortunately, it didn't. Even the event publisher had some trouble and the state transition logs were missing (issue similar to this -> https://github.com/resilience4j/resilience4j/issues/1591)

//Sample Code: (kotlin)

  circuitBreaker.eventPublisher.onStateTransition { event: CircuitBreakerOnStateTransitionEvent ->
            logger.warn("CircuitBreaker StateTransition => ${event.stateTransition}")
        }

and to solve it, I had to do hard reset of circuit breaker(dirty fix though). Currently, I am not able to replicate issue either. but in future state, I do not want to do a hard reset so looking for suggestions here