Closed cijothomas closed 8 months ago
This issue is slightly related to https://github.com/open-telemetry/opentelemetry-dotnet/issues/2524, but more complex. A few attempts were done to achieve this, see PRs: https://github.com/open-telemetry/opentelemetry-dotnet/pull/2466 https://github.com/open-telemetry/opentelemetry-dotnet/pull/2600 https://github.com/open-telemetry/opentelemetry-dotnet/pull/2598
Some observations:
ReadWriteLockSlim
cannot be used as is, without sacrificing performance.This requires more investigation and time to address. For 1.2 release, https://github.com/open-telemetry/opentelemetry-dotnet/issues/2524 and https://github.com/open-telemetry/opentelemetry-dotnet/issues/2358 will be focused, and this issue will be re-investigated post 1.2
Throwing couple ideas here:
We are currently hitting an issue where we happen to have a metric with an ever increasing amount of unique tags which causes us to hit the max metric points per stream very quickly (even after increasing the max to 1,000,000). Is there any planned work around this or anything we can do to mitigate the issue until something permanent is in place?
Cardinality issues can be tricky. Depending on your scenario, the ability for the SDK to reclaim metric points may or may not help resolve your issue.
If there are known keys that have a large or infinite set of values one solution may be to use the View API to filter to a specific set of keys thereby excluding problematic ones.
We have an example here:
We are currently hitting an issue where we happen to have a metric with an ever increasing amount of unique tags which causes us to hit the max metric points per stream very quickly (even after increasing the max to 1,000,000). Is there any planned work around this or anything we can do to mitigate the issue until something permanent is in place?
@tgrieger-sf could you share more about the scenario? (e.g. why do you need to put "an ever-increasing amount of unique tags" as metrics dimensions)
@alanwest Unfortunately, for our scenario, we need all of the tags that are coming through (we're the ones creating the data).
@reyang In our scenario, we are using an observable gauge to log the log consumption of various tables in our SQL instances. We don't directly control these tables and they are thousands that get created and dropped in any given hour so we rack up many unique tags in no time.
@tgrieger-sf if these tables are created/dropped quickly, how do you use them as metrics dimension? (e.g. is the goal to get the unique count of tables?)
@reyang Stats are collected every few minutes. Table live for long enough to be caught during our aggregation in our service. The goal isn't to get a unique count of the tables but more to determine which parts of the application (by looking at the tables) are generating the most log. We can't get any less granular than this unfortunately.
@tgrieger-sf I wonder if "which parts of the application" have a limited/finite number of parts that can be derived from the ever-increasing names, if yes, could this be used as the actual dimension?
@reyang You're not wrong but the problem is that's not defined (it's being worked on but isn't a priority and won't be ready anytime soon).
@tgrieger-sf based on the discussion, I feel that the "right" approach seems to be using exemplars to sample the table names, and avoid using the table names as a dimension.
@reyang I appreciate the conversation and I'll look into exemplars. Thanks!
@reyang Any update on this? I'm an internal customer. We are trying to migrate to OpenTelemetry to emit SLI/SLO data which contains ArmId as part of metrics dimensions. Resources get created and deleted daily. It would be great to able to reclaim metrics points on deleted armIds to save some memory consumption. Our service is a quite large. With current design, it seems like we'd need to restart our monitoring service once a week to cleanup memory usage.
@idoublefirstt this is a top item we plan to address before the next release https://github.com/open-telemetry/opentelemetry-dotnet/milestone/36. The actual time might vary depending on the prioritization and bandwidth.
I would like to also strengthen the case for that , in another internal group we have a big security related service which needs this ability greatly as we also need to support some kind of guid for our purpose and this feature is greatly needed
Please follow this issue to be updated about this. Though its treated as high-priority, its not really in the 1.5 or 1.6 milestone. Its still high priority (no doubt!), but it is unlikely to land in 1.5 milestone. I'll propose to make this part of 1.6 milestone (which released later this year)
@tgrieger-sf @idoublefirstt @yoziv and to anyone else who has expressed interest in this feature, please try out the 1.7.0-alpha.1
version of OpenTelemetry SDK which offers this feature. I'd really appreciate any feedback.
You could check PR #4486 to know more about this feature.
@tgrieger-sf @idoublefirstt @yoziv and to anyone else who has expressed interest in this feature, please try out the
1.7.0-alpha.1
version of OpenTelemetry SDK which offers this feature. I'd really appreciate any feedback.You could check PR #4486 to know more about this feature.
There is an update in the Metrics SDK behavior starting the latest stable 1.7.0
release. Instead of making this the default behavior of the SDK, we have offered it as an experimental opt-in only feature. You can opt-in to enable this behavior by setting OTEL_DOTNET_EXPERIMENTAL_METRICS_RECLAIM_UNUSED_METRIC_POINTS
to true
either using an environment variable or through IConfiguration. Check #5052 for more details.
I'm going to close this because we have the feature work done. I opened #5443 for future work to make it stable.
Each Metric keeps track of its MetricPoints. This points are never removed, even if there are no new updates to it. For delta aggregation: We should reclaim any un-used MetricPoints, when we hit Metric Point limits. Due to pre-allocation of MetricPoints in current implementation, there won't be any memory savings due to this. But we'll be able to accomodate more MetricPoints, by reclaiming unused points. This would mean the MetricPoint limit is the "limit of active points", rather than "limit of points seen since the beginning of process".
For cumulative aggregation: Its possible to "forget" an un-used Metricpoint. However, if the forgotten point is reported again, we'll not be able to calculate cumulative since the process start. This would result in a reset (i.e start time will be reset).
EDIT by utpilla:
This issue has been updated to track MetricPoint reclaim for Delta aggregation temporality only.