Open swar8080 opened 1 month ago
Pinging code owners:
exporter/prometheus: @Aneurysm9 @dashpole
See Adding Labels via Comments if you do not have permissions to add labels yourself.
I think the ideal TTL would be 2x the interval at which you are scraping the prometheus exporter, so a single failed scrape doesn't trigger counter resets.
If you are using the prometheus receiver to scrape your targets, this shouldn't matter. Staleness markers from the target disappearing in the receiver should cause the series to be removed from the cache. E.g. https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/c6cda872d340228092a3fb717081e5525dd0f1e8/exporter/prometheusexporter/accumulator.go#L106
I think the ideal TTL would be 2x the interval at which you are scraping the prometheus exporter, so a single failed scrape doesn't trigger counter resets.
If you are using the prometheus receiver to scrape your targets, this shouldn't matter. Staleness markers from the target disappearing in the receiver should cause the series to be removed from the cache. E.g.
That suggestion helps a lot. We're using OTLP to push cumulative metrics to the collector. I didn't realize the OTEL SDK sends a data point even if it hasn't been updated - so we don't have to worry about excessive counter resets. Combining that with the SDK cardinality limit should protect against label cardinality explosions
The use-case i'm not sure about is span metrics sent to prometheusexporter
. We've had a few cardinality explosions caused by unique span names. The span metrics connector can either emit span counts as deltas as spans come in, or keep track of cumulative counts that're re-emitted on an interval like an app SDK would. If emitting delta temporality then there's a lot of counter resets for infrequent series if prometheusexporter
TTL is low. We're currently using a TTL of 24h. If using cumulative span metrics it solves the counter reset problem and allows low prometheus TTL, but it's shifting the unbounded memory problem to the span metric cache, since it also only supports TTL. So adding the series limit in span metrics connector could be another solution
Component(s)
exporter/prometheus
Is your feature request related to a problem? Please describe.
There's currently no limit on the number of metric series cached by this component. This increases the risk of destabilizing collectors during a cardinality explosion or increase in cardinality because of high memory
The cache
metric_expiration
/TTL option is useful but its hard to tune it correctly. If the expiration is too short you get a lot of counter resets which seems to decrease usability with promql, and too long and you increase the risk of high memoryIn our prometheus set-up, we configure a sample_limit on collector scrape targets to reduce the impact of a cardinality explosion on prometheus health and on cost. Ideally we could set the collector cache's max size slightly higher than the
sample_limit
Describe the solution you'd like
An optional configuration to limit the size of this component's metric cache. When the capacity is exceeded, maybe it prioritizes discarding datapoints with the oldest timestamps. Probably temporarily exceeding the cache size is okay if it makes the implementation easier or more performant, similar to how expired items aren't purged until the next scrape
Describe alternatives you've considered
No response
Additional context
I can take a stab at implementing this. At first glance golang-lru might allow for the "last updated" purging strategy but would need to dig deeper into thread safety and performance