Disable Cardinality limit / pre-aggregation?

johanstenberg92 commented 6 months ago

What is the question?

Hello,

is there a way to disable the pre-aggregation and cardinality limit and let the system which receives the metrics handle the throttling problem?

An alternative would be to heavily over estimate the cardinality limit, but then there’s a concern with the initial memory allocation (which could also be better explained in the docs).

Our product has a backing system which can handle the cardinality we need, but we are concerned to put fix numbers in apps reporting to it and the memory consumption since we have some huge cardinalities.

The documentation doesn’t have a solution for this scenario, do you have any advice? Thanks

Additional context

No response

cijothomas commented 6 months ago

is there a way to disable the pre-aggregation and cardinality limit

No. MeasurementProcessor, as a concept would technically allow one to by pass all in-memory aggregations and export raw measurements directly, but such a thing do not exist in the spec.

Cardinality Limit docs are here, which talks about an experimental feature to reclaim unused points. https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#cardinality-limits

Yes the part about the upfront memory allocation part is not very explicit in the doc, good callout. Feel free to send a PR if you are up for it, else we'll do it.

(Note: 1 metric point is less than 100 bytes, so even with 100,000 extra metric points, its ~10 MB extra memory. Do your own benchmarks and see if this is acceptable.)

We don't yet have a mechanism to monitor the utilization - once that lands, it'll be easy to monitor how much is actually utilized vs wasted..

johanstenberg92 commented 6 months ago

Thank you for your response. Just to expand:

I previously used Datadog’s “metrics without limits” where you essentially let the apps send whatever they can and the even configure what dimensions you care about in the central system, and don’t aggregate on those you don’t care about. I feel a bit constrained with this solution, and I’m concerned with the burden of maintaining max cardinality stats in the app and the potential risk for the memory.

that being said we’ll start experimenting, thanks again.

hugo-brito commented 6 months ago

From what I read in https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#cardinality-limits it seems like the cardinality limit is tweakable but uniformly enforced across all metrics.

How would the SDK know about and pre-allocate all the needed objects for all the metrics, if these are unknown at the beginning of the program?

If we are to estimate a worst-case for the most complex metric, which will then dictate the memory allocation for all the other metrics, wouldn't it be more prudent to consider metric-specific cardinality? The "one size fits all" approach feels a bit lacking...

Furthermore, with the current approach, we now must maintain this cardinality limit... Code changes will be needed if suddenly your cluster can fit double or triple the users.

So in summary, it would be great to either set the cardinality per metric and/or, emit the metrics raw.

cijothomas commented 6 months ago

How would the SDK know about and pre-allocate all the needed objects for all the metrics, if these are unknown at the beginning of the program?

Not at the beginning of the program, but whenever an instrument is created, SDK pre-allocates 2000 MetricPoints by default.

cijothomas commented 6 months ago

dictate the memory allocation for all the other metrics, wouldn't it be more prudent to consider metric-specific cardinality? The "one size fits all" approach feels a bit lacking

So in summary, it would be great to either set the cardinality per metric and/or, emit the metrics raw.

You are right! Ability to set the cardinality per metric is already supported as experimental feature, available in pre-release builds.

or, emit the metrics raw.

This is not something we plan to offer, until spec allows it!

hugo-brito commented 5 months ago

With the current implementation, shouldn't at least exist a mechanism for us to know if metrics are being dropped silently (due to low cardinality)?

cijothomas commented 5 months ago

With the current implementation, shouldn't at least exist a mechanism for us to know if metrics are being dropped silently (due to low cardinality)?

There is an internal log emitted when the limit is hit for the first time. This is the current state. (It is not ideal, and overflow attribute will go a long way into making this experience smoother. And once we expose utilization metric, that'd make things much better than today)

hugo-brito commented 5 months ago

Is there any guidance on how to expose such metric? That way we could at least know if we're lowballing the max cardinality.

cijothomas commented 5 months ago

Is there any guidance on how to expose such metric? That way we could at least know if we're lowballing the max cardinality.

https://github.com/open-telemetry/opentelemetry-dotnet/issues/3880 This is the tracking issue! There were few attempts in the past, but nothing got shipped. If you are passionate about this space, consider contributing and we can guide you through the process! The linked issue can point you to the previous PRs attempting this, to see if you can pick it up.

reyang commented 5 months ago

Is there any guidance on how to expose such metric? That way we could at least know if we're lowballing the max cardinality.

@hugo-brito https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#cardinality-limits has captured some useful links, note that there are lots of moving pieces, and the specification is still Experimental https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#cardinality-limits.

okonaraddi-msft commented 5 months ago

(Note: 1 metric point is less than 100 bytes, so even with 100,000 extra metric points, its ~10 MB extra memory. Do your own benchmarks and see if this is acceptable.)

Is there more info on where the 100 bytes comes from?

I'm wondering if a metric point could be >100 bytes. Like what if there were many, large key-value pairs (like 50 keys each with a 50-character name and a 50-character string value) stored in the MetricPoint's Tags?

cijothomas commented 5 months ago

Is there more info on where the 100 bytes comes from?

Size of MetricPoint struct. (Of course the thing it points to could be very large as that depends on the size of keys/values etc, but MetricPoint itself is fixed size)

clupo commented 1 month ago

@hugo-brito

Is there any guidance on how to expose such metric? That way we could at least know if we're lowballing the max cardinality.

Appears there's an environment variable OTEL_DOTNET_EXPERIMENTAL_METRICS_EMIT_OVERFLOW_ATTRIBUTE to flip on to help with that

https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#cardinality-limits

In my testing the tag that shows up on the offending metrics is otel.metric.overflow:true

open-telemetry / opentelemetry-dotnet

Disable Cardinality limit / pre-aggregation? #5618

What is the question?

Additional context