Open nerdondon opened 1 year ago
The http.server.duration
metric is a histogram, and histograms capture the total count of observations. Does that not work for you?
An count
aggregator would be useful for instances like this.
I've been meaning to send a PR to suggest removing this from the metrics SDK spec:
Customize the aggregation - if the default aggregation associated with the Instrument does not meet the needs of the user. For example, an HTTP client library might expose HTTP client request duration as Histogram by default, but the application developer might only want the total count of outgoing requests.
Because it is not currently supported to aggregate to just a count -- I guess except for a 1 bucket histogram, which isn't really intuitive or what this is really saying.
The
http.server.duration
metric is a histogram, and histograms capture the total count of observations. Does that not work for you?
@mateuszrzeszutek that seems be fine if not somewhat non-ergonomic as @tsloughter is pointing out. A developer in this case would have to report histogram measurements exactly or override and use a 1 bucket histogram.
@tsloughter I'm not sure what the exact issue is regarding supporting aggregation to a count though. Can you elaborate? I thought the histogram should just expose a property count
that represents the total population of points.
Edit: Also, wanted to ask what it would look like for a client that wanted to report different sets of attributes between duration and count. Would they have to keep two different histogram measures of duration? It still seems more ergonomic that there is a dedicated count
instrument.
Additionally, from the telemetry consumption side, it seems that there should be a defined count instrument. That way, clients claiming compliance with HTTP semantic conventions would need to provide a request count instrument.
cc: @jsuereth
@nerdondon no issue, I think a count
aggregator should be proposed. I guess my side note about the spec muddied that :)
@SergeyKanzhelev or @jsuereth sorry for the bump but i just wanted to see ask if there was something I could do to get more movement on this?
In my preference to name that metric as rate: http.server.rate
, http.client.rate
As user I'm interested in current successful rate and response rate. Means every metric should send both values one with tag status=success
and with tag status=fail
corresponding to both Success Rate
and Error Rate
metrics I can visualize with tools like Grafana.
as an alternative naming can follow http.{server|client}.rate.{success|error}
However definition of the error may be vary ( timeouts / 5xx errors or status >= 400 ) and must be specified as well.
Example of metrics with grafana tooling: https://grafana.com/grafana/plugins/novatec-sdg-panel/
Response Time | in_timesum
Request Rate | in_count
Error Rate | error_in
Response Time (Outgoing) | out_timesum
Request Rate (Outgoing) | out_count
Error Rate (Outgoing) | error_out
In my understanding of the current structure of semantic conventions, a proposal would involve the standardization of a particular instrument. In the case of my proposed the count, this is just a counter from the OTel API. I'm not sure what that would look like with a rate. In any case, rate is an aggregation over the instrument. Having the base instrument would allow other aggregations as desired. Also, note the prior art that I linked with regard to this proposal. Using a count, would enable an easier path to adoption of the conventions.
I don't want to muddy the waters here by bringing in a discussion on an attribute for status class (there's already another issue regarding that IIRC). This is specifically about an operations instrument count instrument that can be used to fulfill part of the use case you mentioned and others.
I don't think an instrument is needed, only a Count aggregation.
+1 for this, I'm using Datadog and I'm seeing no way of extracting the count of the duration to do this metric.
+1 for this, I'm using Datadog and I'm seeing no way of extracting the count of the duration to do this metric.
it may be worth asking Datadog about this, since I believe other backends are getting the request count from the http.server.duration
metric https://github.com/open-telemetry/semantic-conventions/issues/1362
With current conventions most of our concerns are solved with: | Metric | PromQL example |
---|---|---|
Latency, P99 | histogram_quantile(0.99, http.server.request.duration{}) |
|
Request Rate | count(http.server.request.duration{}) |
|
Rate Increase | rate(http.server.request.duration{}) |
|
Error Rate | (count(http.server.request.duration{ http.response.status_code =~ "5.*"}) or vector(0)) / count(http.server.request.duration}) |
|
Inflight requests | http.server.active_requests{} |
I think it worth to have some documentation on how OTEL metric can be converted to operational metrics
What are you trying to achieve?
Add instruments for HTTP client and server request counts.
Additional context.
The current HTTP semantic conventions only has an instrument for active requests (
http.server.active_requests
). This proposal is to add counters likehttp.server.request.count
andhttp.client.request.count
. This seemed to be part of the original PR for HTTP semantic conventions ashttp.{type}.requests
but was lost somehow. I think this is a prime candidate to be codified as a semantic convention because of very similar instrumentation across service meshes:This metric is also important for capturing QPS and deriving error rate.
If this is desirable, I think I would be willing to take a crack at adding it.