resilience4j / resilience4j

Resilience4j is a fault tolerance library designed for Java8 and functional programming
Apache License 2.0
9.72k stars 1.33k forks source link

Resilience4j circuit breaker actuator heath check metrics showing some negative numbers #1017

Open agnihp opened 4 years ago

agnihp commented 4 years ago

Resilience4j version: 1.4.0

Java version: 1.8.0_65

I am using resiliance4j circuit breaker with spring boot. in the actuator health point metrics, I am seeing some disparities in slow-calls, a slow failed calls number are coming in negative instead of positive. which is not letting my circuit breaker to open. can anyone help me the significance of this, what these negative values mean?

"endpoint1":{
           "status":"UP",
           "details":{
              "failureRate":"0.0%",
              "failureRateThreshold":"50.0%",
              "slowCallRate":"0.0%",
              "slowCallRateThreshold":"50.0%",
              "bufferedCalls":3500,
              "slowCalls":0,
              "slowFailedCalls":-2682,
              "failedCalls":0,
              "notPermittedCalls":0,
              "state":"CLOSED"
           }
        }

I am using spring boot version 2.2.0, reactor core version: 3.3.5. I am using an annotation-based circuit breaker. and this issue is coming on higher load.

bbarin commented 4 years ago

Resilience4j version: 1.4.0

Java version: 1.8.0_65

I am using resiliance4j circuit breaker with spring boot. in the actuator health point metrics, I am seeing some disparities in slow-calls, a slow failed calls number are coming in negative instead of positive. which is not letting my circuit breaker to open. can anyone help me the significance of this, what these negative values mean?

"endpoint1":{
           "status":"UP",
           "details":{
              "failureRate":"0.0%",
              "failureRateThreshold":"50.0%",
              "slowCallRate":"0.0%",
              "slowCallRateThreshold":"50.0%",
              "bufferedCalls":3500,
              "slowCalls":0,
              "slowFailedCalls":-2682,
              "failedCalls":0,
              "notPermittedCalls":0,
              "state":"CLOSED"
           }
        }

I am using spring boot version 2.2.0, reactor core version: 3.3.5. I am using an annotation-based circuit breaker. and this issue is coming on higher load.

I'm also facing the same issue when exporting the metrics via Prometheus, the failure rate shows negative numbers for some of my circuit breakers.

bbarin commented 4 years ago

More info on this. It seems related to the code block below: private float getFailureRate(Snapshot snapshot) { int bufferedCalls = snapshot.getTotalNumberOfCalls(); if (bufferedCalls == 0 || bufferedCalls < minimumNumberOfCalls) { return -1.0f; } return snapshot.getFailureRate(); }

dc-dream11 commented 4 years ago

I am facing the same issue, has anyone able to solve this? {"@timestamp":"2020-09-13T18:54:07.436Z", "log.level": "INFO", "message":"CircuitBreaker: CB1 | Successful call count: 0 | Failed call count: 4 | Failure rate %:-1.0 | Slow call count: 4 | Slow rate %:-1.0 | Slow failed call count: -206 | Slow success call count: 210 | State: CLOSED"}

dc-dream11 commented 4 years ago

Adding more details to the issue, I have added a circuit breaker over the mysql database for slow calls. When my database connection is fine, I got the following state on a successful call. {"@timestamp":"2020-09-13T20:38:47.572Z", "log.level": "INFO", "message":"CircuitBreaker: CB1 | Successful call count: 453 | Failed call count: 3 | Failure rate %:0.65789473 | Slow call count: 85 | Slow rate %:18.64035 | Slow failed call count: 3 | Slow success call count: 82 | State: CLOSED"}

After the above call, my database went down and my vertx application was unable to make a connection to the database, and the connection timeout is of 5sec. And there is only one thread who tries to get the connection from the pool, it timed out after 5 sec and I got the following state after 5 sec. in this 5 sec no other calls are executed as other calls were waiting for getConnectionThread to get free. {"@timestamp":"2020-09-13T20:38:52.573Z", "log.level": "INFO", "message":"CircuitBreaker: CB1 | Successful call count: 0 | Failed call count: 4 | Failure rate %:-1.0 | Slow call count: 4 | Slow rate %:-1.0 | Slow failed call count: 4 | Slow success call count: 0 | State: CLOSED"}

Something happened in this 5-sec gap, where the metrics went negative. Not able to figure out the reason yet. @dlsrb6342 can you please help here?

RobWin commented 4 years ago

Failure rate and slow call rate are shown as -1.0, if the number of measured calls is below the minimum number of calls. A failure rate of 0 would be wrong in that acse.

RobWin commented 4 years ago

https://resilience4j.readme.io/docs/circuitbreaker#failure-rate-and-slow-call-rate-thresholds

selly-selly commented 3 years ago

I am facing the same issue.

"details": {
"failureRate": "0.0%",
"failureRateThreshold": "10.0%",
"slowCallRate": "0.0%",
"slowCallRateThreshold": "10.0%",
"bufferedCalls": 150,
"slowCalls": 0,
"slowFailedCalls": -134570,
"failedCalls": 0,
"notPermittedCalls": 0,
"state": "CLOSED"
}

I mind the slowFailedCalls. It is in negative. Although the circuit breaker switched to OPEN as expected during trouble, but once the metrics are reset after state change to HALF_OPEN and CLOSE, then some slow or failed calls are recorded, after some time when the slow and failed calls closes to 0, slowFailedCalls starts to come in negative.

Seems the next opportunity for slowFailedCalls to reset to 0 is not certain. I am also using an annotation-based circuit breaker. Is this behavior expected?

This is in production now and would like to rollback if this is not expected behavior.

RobWin commented 3 years ago

@selly-selly Which version are you using?

selly-selly commented 3 years ago

@RobWin , Thanks for quick response. I'm using resilience4j-spring-boot2 v1.7.0, spring boot v2.2.1

Seems it happens when slowFailedCalls count < slowCalls or failedCalls "slowCalls": 6, "slowFailedCalls": 3, "failedCalls": 6, From above metrics then as success calls come in: "slowCalls": 6 → 5 → 4 → 3 → 2 → 1 → 0 → 0 "slowFailedCalls": 3 → 2 → 1 → 0 → -1 → -2 → -3 → -4 "failedCalls": 6 → 5 → 4 → 3 → 2 → 1 → 0 → 0

selly-selly commented 3 years ago

Hello~ Any updates about this?