Add metric for command processor latency

jeffreyscarpenter commented 1 month ago

In PR #1029 we removed the histograms from the command_processor_process metrics because of the extremely high cardinality that was causing problems for Grafana dashboards and metrics forwarding.

The high cardinality was caused by the multiplicity of tags we supported including command, tenant, sort type, error, error class, error_code, and vector_enabled. When adding histogram buckets to this, it potentially results in a huge number of series.

However it would still be useful to be able to track latency by command for debugging purposes. We should add a new metric called command_processor_latency that is a histogram metric tagged by command and tenant only.

amorton commented 1 month ago

Confirmed when requests to /metrics say they accept compression it is returned without

Hazel-Datastax commented 1 month ago

This metric may cause a high cardinality problem, so I got two suggestions from Frank Rosner:

le="+Inf" That can be problematic. Prom histograms are known to be inefficient. You can read about it here: https://github.com/riptano/cndb/issues/8466 We're trying to use the victoriametrics specific histogram implementation which looks promising in terms of reducing storage cost and other things.

Please consider supporting gzip encoding for larger scrapes (>2 MB) in your endpoint. vmagent will send the corresponding Accept-Encoding header automatically. Maybe it's as simple as quarkus.http.enable-compression=true?

After some investigations for these two suggestions, I found:

Regarding Victoria Metrics histograms, Micrometer doesn't support that format. Implementing them might be costly as it would require customizing code for Micrometer or switching to the Victoria Metrics library.
Enable gzip is not as simple as quarkus.http.enable-compression=true

2.1 While Quarkus provides configuration options such as quarkus.http.enable-compression=true to support response body compression, this setting only applies to application endpoints. It doesn’t seem to work with the Micrometer-registered custom route /metrics. I’ve tried various related configurations with no success. Additionally, I found a relevant discussion in the Quarkus repo(https://github.com/quarkusio/quarkus/issues/26112), where someone noted: In Quarkus 2.9.2.Final, I also observed that the /q/metrics endpoint (from quarkus-smallrye-metrics) response is no longer compressed. It seems that at some point, compression for the /metrics endpoint stopped working after an update.

2.2 Micrometer itself doesn’t provide built-in support for compressing metrics responses, so there’s no direct configuration available to handle this.

2.3 I also attempted to use HTTP filters to intercept requests to the /metrics endpoint, but they didn’t work either. Another discussion I found (https://github.com/quarkusio/quarkus/discussions/32671) mentioned that filters can’t intercept non-application endpoints. One proposed solution is to bypass the built-in /metrics endpoint and handle it manually by scraping the metrics, compressing them, and returning the compressed response.

stargate / data-api

Add metric for command processor latency #1501