open-telemetry / opentelemetry-python-contrib

OpenTelemetry instrumentation for Python modules
https://opentelemetry.io
Apache License 2.0
739 stars 612 forks source link

OTEL_SEMCONV_STABILITY_OPT_IN latency buckets to big #3011

Open bergur88 opened 1 week ago

bergur88 commented 1 week ago

Describe your environment

services are build with docker python:3.10.15-slim and run on k8s services use opentelemetry-api==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-propagator-b3==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-instrumentation-aiohttp-client==0.48b0 opentelemetry-instrumentation-asyncpg==0.48b0 opentelemetry-instrumentation-psycopg==0.48b0 opentelemetry-instrumentation-psycopg2==0.48b0 opentelemetry-instrumentation-requests==0.48b0 opentelemetry-instrumentation-logging==0.48b0 opentelemetry-instrumentation-system-metrics==0.48b0 opentelemetry-instrumentation-grpc==0.48b0

What happened?

I'm using the OTEL_SEMCONV_STABILITY_OPT_IN feature (I'm currently running http/dup ) and am seeing some weird results with http latencies. It seems to me to use the same bucket sizes as the old metrics. Doesn't the buckets need to be smaller since the unit has been changed from milliseconds to seconds, with the lowest bucket being 5 seconds it not particularly useful and most percentiles calculated from my metrics show that p99 for most of my services/paths are 5 seconds which is not very accurate.

nodejs and dotnet overwrite the default buckets with more sane values.

image image

The images show the same metric during the same time for the same labelset as a histogram and the older one being more granular and useful. sum(rate(http_server_duration_milliseconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le) sum(rate(http_server_request_duration_seconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)

Steps to Reproduce

set OTEL_SEMCONV_STABILITY_OPT_IN="http/dup"

it can then be visualized in graphana similarly to this: sum(rate(http_server_duration_milliseconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le) sum(rate(http_server_request_duration_seconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)

Expected Result

I expected to see the same percentiles for my services/paths using the semantic metrics.

Actual Result

new metrics are scewed towards 5 seconds because of buckets sizes.

Additional context

No response

Would you like to implement a fix?

None