spring-cloud / spring-cloud-circuitbreaker

Spring Cloud Circuit Breaker API and Implementations
Apache License 2.0
329 stars 110 forks source link

Resilience4J bulkhead metrics missing at /actutor/prometheus #121

Closed tjuchniewicz closed 2 years ago

tjuchniewicz commented 2 years ago

How to reproduce

  1. use two different circuit breaker (CB) instances (cb1, cb2)
  2. define resilience4j.thread-pool-bulkhead.configs.default.coreThreadPoolSiz: 5
  3. define custom config for only one CB instance resilience4j.thread-pool-bulkhead.instances.cb2.coreThreadPoolSize: 10
  4. you will not see metrics for cb1 (e.g. resilience4j.bulkhead.max.thread.pool.size) at /prometheus but /metrics is fine

Explanation Prometheus requires that all meters with the same name have the same set of tag keys. See: https://github.com/micrometer-metrics/micrometer/issues/877 When we define custom config for circuit breaker, Spring Cloud Circuit Breaker (SCCB) will initialize CB instance at startup and this automatically register metric in Micrometer without any tags. At runtime, for all CB instances, SCCB executes Resilience4jBulkheadProvider.run() and initializes other CB using group tag. See: https://github.com/spring-cloud/spring-cloud-circuitbreaker/blob/72868b72448d8bf072f168093b6bdadec609d9c7/spring-cloud-circuitbreaker-resilience4j/src/main/java/org/springframework/cloud/circuitbreaker/resilience4j/Resilience4JCircuitBreaker.java#L107 The result is that Prometheus will ignore metrics for cb1.

Workaround Remove group tag in your code:

@Bean
public MeterFilter removeGroupTagFromResilience4jBulkheadMeterFilter() {
    return new MeterFilter() {
        @Override
        public Meter.Id map(Meter.Id id) {
            if(id.getName().startsWith("resilience4j.bulkhead")) {
                List<Tag> tags = id.getTags().stream().filter(t -> !"group".equals(t.getKey())).collect(Collectors.toList());
                return id.replaceTags(tags);
            }
           return id;
        }
    };
}

Proposed fix Do not pass group tag to Resilience4jBulkheadProvider. Do we really need group as tag? Looks like the same value is used as name tag (created from id).

ryanjbaxter commented 2 years ago

When we define custom config for circuit breaker, Spring Cloud Circuit Breaker (SCCB) will initialize CB instance at startup and this automatically register metric in Micrometer without any tags

Where does this happen? I am not seeing it

Also if we removed the group tag wouldn't that result in the CB not being part of the group?

Wonder if @Ferioney might care to comment as he originally wrote this code.

tjuchniewicz commented 2 years ago

Where does this happen? I am not seeing it

See: ThreadPoolBulkheadConfiguration.threadPoolBulkheadRegistry(): https://github.com/resilience4j/resilience4j/blob/809080dfcdf2a72efd27c5b1ad0e78d8c4c00176/resilience4j-spring/src/main/java/io/github/resilience4j/bulkhead/configure/threadpool/ThreadPoolBulkheadConfiguration.java#L72

Also if we removed the group tag wouldn't that result in the CB not being part of the group?

In my opinion tags sent to Resilience4j are only used for metrics. Grouping is implemented in Spring Cloud and is not directly related to any of Resilience4j feature. Will not break anything in my opinion.

ryanjbaxter commented 2 years ago

@tjuchniewicz would it be possible to provide a sample that reproduces the problem so I can dig into it further?

tjuchniewicz commented 2 years ago

@ryanjbaxter here you go: https://github.com/tjuchniewicz/r4j-metrics-issue

tjuchniewicz commented 2 years ago

@ryanjbaxter is this really fixed? I'm using:

<dependency>
   <groupId>org.springframework.cloud</groupId>
   <artifactId>spring-cloud-starter-circuitbreaker-resilience4j</artifactId>
   <version>2.1.1</version>
</dependency>   

and issue still exists?

ryanjbaxter commented 2 years ago

It should be yes, I tested it with the sample you provided in the issue

tjuchniewicz commented 2 years ago

It should be yes, I tested it with the sample you provided in the issue

Can you show what exactly you change in the sample? I double checked using spring-cloud-starter-circuitbreaker-resilience4j 2.1.1 and issue still exists.

ryanjbaxter commented 2 years ago

The only change I made was switch the boot version to 2.6.3 and the cloud version to 2021.0.1.

I also fixed the property names in application properties

resilience4j.thread-pool-bulkhead.configs.default.coreThreadPoolSize=10
resilience4j.timelimiter.configs.default.coreThreadPoolSize=10s

If I go to http://localhost:8080/actuator/prometheus using 2021.0.0 I get the following...store-one missing.

# HELP resilience4j_bulkhead_thread_pool_size The thread pool size
# TYPE resilience4j_bulkhead_thread_pool_size gauge
resilience4j_bulkhead_thread_pool_size{name="store-two",} 1.0

If I go to http://localhost:8080/actuator/prometheus using 2021.0.1 I get the following...

# HELP resilience4j_bulkhead_core_thread_pool_size The core thread pool size
# TYPE resilience4j_bulkhead_core_thread_pool_size gauge
resilience4j_bulkhead_core_thread_pool_size{group="none",name="store-two",} 9.0
resilience4j_bulkhead_core_thread_pool_size{group="store-one",name="store-one",} 10.0
tjuchniewicz commented 2 years ago

@ryanjbaxter See https://github.com/tjuchniewicz/r4j-metrics-issue/commit/69bb56db2d62e8eccddd58f394db370530b478a7. I've checked several times.. For me issue still exists... Could you double check on your side again?

ryanjbaxter commented 2 years ago

Can you add feign.circuitbreaker.enabled=true to your application.properties?

tjuchniewicz commented 2 years ago

Works with feign.circuitbreaker.enabled=true. For my use-case this is fine. I use Resilience4J only with Feign. Not sure about other use cases.