open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.98k stars 2.31k forks source link

[exporter/prometheus] Limit prometheusexporter metric cache size to protect against destabilizing memory use #34938

Open swar8080 opened 1 month ago

swar8080 commented 1 month ago

Component(s)

exporter/prometheus

Is your feature request related to a problem? Please describe.

There's currently no limit on the number of metric series cached by this component. This increases the risk of destabilizing collectors during a cardinality explosion or increase in cardinality because of high memory

The cache metric_expiration/TTL option is useful but its hard to tune it correctly. If the expiration is too short you get a lot of counter resets which seems to decrease usability with promql, and too long and you increase the risk of high memory

In our prometheus set-up, we configure a sample_limit on collector scrape targets to reduce the impact of a cardinality explosion on prometheus health and on cost. Ideally we could set the collector cache's max size slightly higher than the sample_limit

Describe the solution you'd like

An optional configuration to limit the size of this component's metric cache. When the capacity is exceeded, maybe it prioritizes discarding datapoints with the oldest timestamps. Probably temporarily exceeding the cache size is okay if it makes the implementation easier or more performant, similar to how expired items aren't purged until the next scrape

Describe alternatives you've considered

No response

Additional context

I can take a stab at implementing this. At first glance golang-lru might allow for the "last updated" purging strategy but would need to dig deeper into thread safety and performance

github-actions[bot] commented 1 month ago

Pinging code owners:

dashpole commented 1 month ago

I think the ideal TTL would be 2x the interval at which you are scraping the prometheus exporter, so a single failed scrape doesn't trigger counter resets.

If you are using the prometheus receiver to scrape your targets, this shouldn't matter. Staleness markers from the target disappearing in the receiver should cause the series to be removed from the cache. E.g. https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/c6cda872d340228092a3fb717081e5525dd0f1e8/exporter/prometheusexporter/accumulator.go#L106

swar8080 commented 3 weeks ago

I think the ideal TTL would be 2x the interval at which you are scraping the prometheus exporter, so a single failed scrape doesn't trigger counter resets.

If you are using the prometheus receiver to scrape your targets, this shouldn't matter. Staleness markers from the target disappearing in the receiver should cause the series to be removed from the cache. E.g.

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/c6cda872d340228092a3fb717081e5525dd0f1e8/exporter/prometheusexporter/accumulator.go#L106

That suggestion helps a lot. We're using OTLP to push cumulative metrics to the collector. I didn't realize the OTEL SDK sends a data point even if it hasn't been updated - so we don't have to worry about excessive counter resets. Combining that with the SDK cardinality limit should protect against label cardinality explosions

The use-case i'm not sure about is span metrics sent to prometheusexporter. We've had a few cardinality explosions caused by unique span names. The span metrics connector can either emit span counts as deltas as spans come in, or keep track of cumulative counts that're re-emitted on an interval like an app SDK would. If emitting delta temporality then there's a lot of counter resets for infrequent series if prometheusexporter TTL is low. We're currently using a TTL of 24h. If using cumulative span metrics it solves the counter reset problem and allows low prometheus TTL, but it's shifting the unbounded memory problem to the span metric cache, since it also only supports TTL. So adding the series limit in span metrics connector could be another solution