open-telemetry / opentelemetry-dotnet

The OpenTelemetry .NET Client
https://opentelemetry.io
Apache License 2.0
3.23k stars 765 forks source link

Ability to forget and reclaim MetricPoint (Delta) #2360

Closed cijothomas closed 8 months ago

cijothomas commented 3 years ago

Each Metric keeps track of its MetricPoints. This points are never removed, even if there are no new updates to it. For delta aggregation: We should reclaim any un-used MetricPoints, when we hit Metric Point limits. Due to pre-allocation of MetricPoints in current implementation, there won't be any memory savings due to this. But we'll be able to accomodate more MetricPoints, by reclaiming unused points. This would mean the MetricPoint limit is the "limit of active points", rather than "limit of points seen since the beginning of process".

For cumulative aggregation: Its possible to "forget" an un-used Metricpoint. However, if the forgotten point is reported again, we'll not be able to calculate cumulative since the process start. This would result in a reset (i.e start time will be reset).

EDIT by utpilla:

This issue has been updated to track MetricPoint reclaim for Delta aggregation temporality only.

cijothomas commented 2 years ago

This issue is slightly related to https://github.com/open-telemetry/opentelemetry-dotnet/issues/2524, but more complex. A few attempts were done to achieve this, see PRs: https://github.com/open-telemetry/opentelemetry-dotnet/pull/2466 https://github.com/open-telemetry/opentelemetry-dotnet/pull/2600 https://github.com/open-telemetry/opentelemetry-dotnet/pull/2598

Some observations:

  1. To account for the possibility of a MetricPoint being reclaimed by someone else, the hot path has to do some synchronization mechanism. This has significantly affected throughput, in all the above PRs.
  2. At the root, the issue is "atomically lookup which MetricPoint should be used for given tags, and do the update, without affecting performance".
  3. As proved by the PR attempts above, plain locks/ or even ReadWriteLockSlim cannot be used as is, without sacrificing performance.

This requires more investigation and time to address. For 1.2 release, https://github.com/open-telemetry/opentelemetry-dotnet/issues/2524 and https://github.com/open-telemetry/opentelemetry-dotnet/issues/2358 will be focused, and this issue will be re-investigated post 1.2

reyang commented 2 years ago

Throwing couple ideas here:

  1. the synchronization (lock) approach - break down the lock granularity - e.g. instead of having a single R/W lock for each point, having a per CPU core/group lock then decide if there is a need to escalate to a bigger lock.
  2. non synchronization approach (even better) - leverage the page faults or segment access from memory manager to prevent stale writes (and the stale writes should result in a recoverable error + retry).
tgrieger-sf commented 1 year ago

We are currently hitting an issue where we happen to have a metric with an ever increasing amount of unique tags which causes us to hit the max metric points per stream very quickly (even after increasing the max to 1,000,000). Is there any planned work around this or anything we can do to mitigate the issue until something permanent is in place?

alanwest commented 1 year ago

Cardinality issues can be tricky. Depending on your scenario, the ability for the SDK to reclaim metric points may or may not help resolve your issue.

If there are known keys that have a large or infinite set of values one solution may be to use the View API to filter to a specific set of keys thereby excluding problematic ones.

We have an example here:

https://github.com/open-telemetry/opentelemetry-dotnet/blob/9c05eeaf9b2f6f19df9924f0ae9b64d870756cb6/docs/metrics/customizing-the-sdk/Program.cs#L42-L43

reyang commented 1 year ago

We are currently hitting an issue where we happen to have a metric with an ever increasing amount of unique tags which causes us to hit the max metric points per stream very quickly (even after increasing the max to 1,000,000). Is there any planned work around this or anything we can do to mitigate the issue until something permanent is in place?

@tgrieger-sf could you share more about the scenario? (e.g. why do you need to put "an ever-increasing amount of unique tags" as metrics dimensions)

tgrieger-sf commented 1 year ago

@alanwest Unfortunately, for our scenario, we need all of the tags that are coming through (we're the ones creating the data).

@reyang In our scenario, we are using an observable gauge to log the log consumption of various tables in our SQL instances. We don't directly control these tables and they are thousands that get created and dropped in any given hour so we rack up many unique tags in no time.

reyang commented 1 year ago

@tgrieger-sf if these tables are created/dropped quickly, how do you use them as metrics dimension? (e.g. is the goal to get the unique count of tables?)

tgrieger-sf commented 1 year ago

@reyang Stats are collected every few minutes. Table live for long enough to be caught during our aggregation in our service. The goal isn't to get a unique count of the tables but more to determine which parts of the application (by looking at the tables) are generating the most log. We can't get any less granular than this unfortunately.

reyang commented 1 year ago

@tgrieger-sf I wonder if "which parts of the application" have a limited/finite number of parts that can be derived from the ever-increasing names, if yes, could this be used as the actual dimension?

tgrieger-sf commented 1 year ago

@reyang You're not wrong but the problem is that's not defined (it's being worked on but isn't a priority and won't be ready anytime soon).

reyang commented 1 year ago

@tgrieger-sf based on the discussion, I feel that the "right" approach seems to be using exemplars to sample the table names, and avoid using the table names as a dimension.

tgrieger-sf commented 1 year ago

@reyang I appreciate the conversation and I'll look into exemplars. Thanks!

idoublefirstt commented 1 year ago

@reyang Any update on this? I'm an internal customer. We are trying to migrate to OpenTelemetry to emit SLI/SLO data which contains ArmId as part of metrics dimensions. Resources get created and deleted daily. It would be great to able to reclaim metrics points on deleted armIds to save some memory consumption. Our service is a quite large. With current design, it seems like we'd need to restart our monitoring service once a week to cleanup memory usage.

reyang commented 1 year ago

@idoublefirstt this is a top item we plan to address before the next release https://github.com/open-telemetry/opentelemetry-dotnet/milestone/36. The actual time might vary depending on the prioritization and bandwidth.

yoziv commented 1 year ago

I would like to also strengthen the case for that , in another internal group we have a big security related service which needs this ability greatly as we also need to support some kind of guid for our purpose and this feature is greatly needed

cijothomas commented 1 year ago

Please follow this issue to be updated about this. Though its treated as high-priority, its not really in the 1.5 or 1.6 milestone. Its still high priority (no doubt!), but it is unlikely to land in 1.5 milestone. I'll propose to make this part of 1.6 milestone (which released later this year)

utpilla commented 1 year ago

@tgrieger-sf @idoublefirstt @yoziv and to anyone else who has expressed interest in this feature, please try out the 1.7.0-alpha.1 version of OpenTelemetry SDK which offers this feature. I'd really appreciate any feedback.

You could check PR #4486 to know more about this feature.

utpilla commented 11 months ago

@tgrieger-sf @idoublefirstt @yoziv and to anyone else who has expressed interest in this feature, please try out the 1.7.0-alpha.1 version of OpenTelemetry SDK which offers this feature. I'd really appreciate any feedback.

You could check PR #4486 to know more about this feature.

There is an update in the Metrics SDK behavior starting the latest stable 1.7.0 release. Instead of making this the default behavior of the SDK, we have offered it as an experimental opt-in only feature. You can opt-in to enable this behavior by setting OTEL_DOTNET_EXPERIMENTAL_METRICS_RECLAIM_UNUSED_METRIC_POINTS to true either using an environment variable or through IConfiguration. Check #5052 for more details.

CodeBlanch commented 8 months ago

I'm going to close this because we have the feature work done. I opened #5443 for future work to make it stable.