open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
287 stars 176 forks source link

Recommended histogram bucket sizes for HTTP connection duration #336

Open JamesNK opened 1 year ago

JamesNK commented 1 year ago

The http.server.request.duration histogram recommends bucket sizes: https://github.com/open-telemetry/semantic-conventions/blob/203691d99612452df0c951640b04521e34969628/docs/http/http-metrics.md?plain=1#L67-L68

A server library has a histogram to track HTTP connection duration. It should have defined bucket sizes, but I'm are unsure what values to set. The HTTP request durations are too short (a connection could last, minutes, hours or even days).

Is there any agreement in the OTEL ecosystem about what good histogram buckets are for HTTP connection duration? (or longer running tasks in general)

trask commented 1 year ago

related: #316

samsp-msft commented 1 year ago

I am suggesting: [0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 30, 60, 120, 300]

in #https://github.com/open-telemetry/opentelemetry-dotnet/issues/4922

It uses an approx 2x escalation for each bucket with alignment to minutes at the end.

It doesn't go up to hours, but the main benefit for longer connections is that you don't need to pay the setup costs of each connection on each request. Once the connection duration is in the order of minutes, the incremental cost of benefit of longer connections rapidly diminishes. This should be a good balance.

JamesNK commented 1 year ago

Up to 300 seconds is much better than 10 seconds. I think there are situations where a connection could live quite a long time. For example, web sockets in the browser (e.g. SignalR) and server-to-server scenarios where a client is reused for a long time.

I removed some of the smaller values and added capacity for up to an hour.

Before: [0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 30, 60, 120, 300]

After: [0, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 30, 60, 120, 300, 600, 1200, 3600]

TBH I'm not sure exactly where most connection lifetimes end up. I would be ok with tracking up to 300 seconds and then adjusting if needed.

Update: ASP.NET Core Kestrel is using: [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 30, 60, 120, 300]

francoposa commented 6 days ago

Is there any update on consideration of this? Particularly when these get converted to Prometheus histograms that do not track min and max, as far as I can tell we can never query the metrics to show request duration longer than 10 seconds.

10 seconds is a shockingly low amount of time for the highest request duration bucket to record.

trask commented 5 days ago

hi @francoposa, this issue is about connection duration (across multiple requests) as opposed to request duration.

you may want to consider using a metric view to configure longer request duration buckets for your use case