open-telemetry / opentelemetry-python

OpenTelemetry Python API and SDK
https://opentelemetry.io
Apache License 2.0
1.66k stars 568 forks source link

Histogram metrics are much larger in v1.23.0 #3959

Open colincadams opened 3 weeks ago

colincadams commented 3 weeks ago

Describe your environment Describe any aspect of your environment relevant to the problem, including your Python version, platform, version numbers of installed dependencies, information about your cloud hosting provider, etc. If you're reporting a problem with a specific version of a library in this repo, please check whether the problem has been fixed on main.

We noticed a very large increase in our GCM cost due to an increase in metrics bytes ingested for our base histogram metrics (e.g. http.client.duration). This coincided with an upgrade to v1.23.0. A subsequent downgrade to v1.22.0 led to a decrease in the bytes ingested and cost increases back to their prior levels.

Screenshot 2024-06-06 at 5 49 44 PM

This commit is the revert: https://github.com/Recidiviz/pulse-data/commit/d321a4e30f612e9964f18106ded28d6a0fce250e

Steps to reproduce Describe exactly how to reproduce the error. Include a code sample if applicable.

Upgrade to v1.23.0 or later (only tested up to v1.24.0, so it is possible it has been fixed)

What is the expected behavior?

No increase in bytes ingested by GCM for histogram metrics.

What is the actual behavior?

Order of magnitude increase in cost.

Additional context

I haven't taken the time to fully understand the changes here, but if this PR led to all of the buckets always being created, and before that was not the case, this could be the culprit: https://github.com/open-telemetry/opentelemetry-python/pull/3429

aabmass commented 2 weeks ago

The fix in #3429 might be the culprit. IIRC the previous behavior (see https://github.com/open-telemetry/opentelemetry-python/issues/3407) was that histograms would not be sent from the SDK to the exporter if there had been no observations since the last export.

Does your app have low QPS or low QPS for certain routes?

colincadams commented 2 weeks ago

@aabmass Yes, this is for a quite low traffic application, so that does seem likely to be the root cause

aabmass commented 2 weeks ago

@colincadams what is your export interval? You may be able to achieve similar cost savings by exporting less often

colincadams commented 1 week ago

Our export interval is 60s, we could certainly reduce it and that would help with cost savings. Did anything about bucket creation change? It seems like a pretty large increase just for reporting frequency, especially given the cardinality of these metrics should be relatively low, but it's possible that's it.