Open gbxavier opened 2 months ago
Pinging code owners:
receiver/prometheus: @Aneurysm9 @dashpole
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This doesn't look like an issue with the receiver. Does increasing the number of consumers help?
No. It does not help. I think it's a performance issue with the exporter. After further experimentation, increasing the batch size to 50000 stopped me from having problems with the queue.
Looks like the number of consumers is hard-coded to 1, which would explain why bumping that up doesn't help...
We should probably emit a warning for people who try to configure this.
So increasing the batch size is probably the only resolution we can offer, which you found.
Action items:
@ArthurSens
I'm struggling a bit to understand how to log a warning. Which object has a logger object during the exporter creation? I can't find any 🤔
You can use exporter.Settings
: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/c6cda872d340228092a3fb717081e5525dd0f1e8/exporter/prometheusremotewriteexporter/factory.go#L38
That is https://pkg.go.dev/go.opentelemetry.io/collector/exporter@v0.108.1/internal#Settings
It includes a Logger: https://pkg.go.dev/go.opentelemetry.io/collector/component#TelemetrySettings
So you can do set.Logger.Info(...
We started getting this warning after upgrading to v0.111, and found it and the README slightly misleading. The value of num_consumers
is still used to control concurrency after PRWE's own batching:
Thanks @ubcharron. @ArthurSens we should revert https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/34993
Thanks @ubcharron. @ArthurSens we should revert #34993
Whoops, my bad. Here is the PR: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/35845
Component(s)
exporter/prometheusremotewrite, receiver/prometheus
What happened?
Description
I have a dedicated instance to scrape metrics from pods exposing Prometheus Endpoints. The instance has more than enough resources to process a lot of metrics (resource utilization never reaches 60%). In this setup, most namespaces are under Network Policies preventing this scraper from reaching the pods found in the Kubernetes autodiscovery.
At first, the metrics reach the target (Grafana Cloud) as expected; but it's possible to immediately notice that the memory consumption keeps on growing and the queue size starts growing slowly until it reaches the capacity and enqueuing starts failing.
The amount of metrics received and sent remains constant, but over time, the delay between the metric being "seen" by the collector and sent to the backend slowly grows to the point that the last observed data point is hours late (but still being received by the backend). This behavior is observed by all receivers configured in the instance, including the
prometheus/self
instance that doesn't face any problem scraping the metrics.This behavior only happens when the workload_prometheus is enabled, and no other instance suffers from this problem or any performance/limits issues.
Steps to Reproduce
Expected Result
The receiver scrapes the metrics from the endpoints it can reach and those metrics are correctly sent through the Prometheus Remote Write Exporter reasonably fast.
Actual Result
Memory consumption increases over time; the delay between the metric being "seen" by the collector and sent to the backend slowly grows to the point that the last observed data point is hours late;
Collector version
0.105.0
Environment information
Environment
OS: AKSUbuntu-2204gen2containerd-202407.03.0 Kubernetes: Azure AKS v1.29.2
OpenTelemetry Collector configuration
Log output
Additional context
The resource utilization is low, but memory grows over time up to 50% of the configured limit, specified below.
Screenshot with metrics from this scraper instance.