Open scholzj opened 3 months ago
Note: The two metrics above were produced by our own examples -> the first one is from the plain Java example and the other from the Vert.x example. So those can be used to reproduce this.
I had a little bit of investigation and it seems that this metrics part is not working as expected since ... 0.20.0 release (using Vert.x 4.1.0) :-o.
The last working for me providing the right path
tag is with 0.19.0 (using Vert.x 3.9.3).
I went through the Vert.x micrometer implementation and more specifically the part which provides the HTTP server request ones (because these are not metrics generated by us) in the VertxHttpServerMetrics
class and currently they are using the request.uri()
for the path
tag:
Up to the latest 3.9.x (so 3.9.16) they were using the request.path()
instead:
This explain the breaking point between bridge 0.19.0 and 0.20.0.
So I guess they had a reason for making the switch even if the right path
meaning in HTTP terms should be request.path()
(just the request path, no server host, no query string) and not request.uri()
(which contains everything up to the query string).
I am going to ask on the Vert.x micrometer GitHub repo about this. Meanwhile checking if there is a way to "override" the tag on our side.
So other than the solution I have in my current open PR I have two more:
MeterFilter
(as pointed by the Vert.x guy) in the bridge to override the path valueThe two solutions are pretty similar but done at two different levels. The Grafana one for example means changing from:
sum(rate(strimzi_bridge_http_server_requests_total{container=~"^.+-bridge",method="GET",path=~"/consumers/.+/instances/.+/records"}[5m])) by (path)
to:
sum(label_replace(rate(strimzi_bridge_http_server_requests_total{container=~"^.+-bridge",method="GET",path=~".*/consumers/.+/instances/.+/records.*"}[5m]), "resource_path", "$1", "path", ".*(/consumers/.+/instances/.+/records).*")) by (resource_path)
It's actually doing a label_replace
so catching the overall new path (notice the added .*
to catch with host and query string) but then replacing with the value of the group specified in the last parameter so just getting what is in the parenthesis here .*(/consumers/.+/instances/.+/records).*
.
Of course, it has to be applied to all the other metrics on the dashboard.
The MeterFilter
solution is about doing actually almost the same but at the bridge level code.
Following a snippet to show how it should work:
Pattern pattern = Pattern.compile(".*(/consumers/.+/instances/.+/records).*");
if (this.metricsReporter.getMeterRegistry() != null) {
// SOME OTHER UNRELATED CODE
this.metricsReporter.getMeterRegistry().config().meterFilter(
MeterFilter.replaceTagValues(Label.HTTP_PATH.toString(), path -> {
Matcher m = pattern.matcher(path);
if (m.matches()) {
return m.group(1);
}
return path;
}, "")
);
}
So we need anyway a pattern for each metric we expose on the dashboard, and than we are matching the pattern and getting the group (as for the Grafana solution).
I am open to discuss what we prefer. Anyway the discussion with Vert.x folks is still open. any thoughts @strimzi/maintainers ? Personally I am for changing the bridge code by using a MeterFilter to leave the dashboards simpler.
After some offline discussion with @scholzj we think that it could be better to leave the bridge as it is while waiting for an answer from Vert.x folks on the issue I raised on the Vert.x micrometer repo. If no fix from their side or something coming later with a new release, we can fix the issue at dashboard level.
Triaged on 03.10.2024: Paolo is going to ping Vert.x folks again on the issue he raised. If no answer in 1-2 weeks, we are going to fix this on our side at dashboard level.
Vert.x folks replied on the issue I opened and they scoped the fix for 4.5.11 release. When it is released we should come back and check that metrics are correct and dashboard finally works again on our side.
Triaged on 17/10/2024: we'll wait for Vert.x 4.5.11 release taking the fix and check that everything is finally fine on our metrics and dashboard side.
The
path
label in the Bridge metrics seems to contain very different values. That breaks dashbords and makes it hard to work with the metrics. For example, for the poll metrics, you can get the metrics like this:As you can notice:
?timeout=100
)http://my-bridge-bridge-service:8080
) as well as the path parameters (?timeout=100
)Both of these breaks the Strimzi Bridge Grafana Dashabord that expects the path to be always only
/consumers/<NAME>/instances/<NAME>/records
. So the dashboard does not show this metric. I noticed it at the Poll chart but this likely impacts also other metrics / charts.