Open tqi-raurora opened 13 hours ago
Pinging code owners:
exporter/prometheus: @Aneurysm9 @dashpole
See Adding Labels via Comments if you do not have permissions to add labels yourself.
How often is the metric being collected? There is an expiration time in the exporter for old points.
Hi @dashpole Thanks for your response
Metrics are scraped every minute.
Expiration time on prometheus exporter is set at 10 minutes.
However, when I was testing I actually didn't use the prometheus scrape: instead I logged into the host where otel collector is being run, and used curl to localhost to check on the prometheus exporter endpoint direclty like:
curl -s http://localhost:19130/metrics
I did this curl multiple times on the same minute. I did this to simulate a scrape, and to discard possible problems with the prometheus scrapig process.
When I filter out the metric below, the problem goes away:
http.server.duration{service.name=ps-sac-fe}
I am trying to simulate a payload equal to the one that seems to be causing the issue, if I'm sucessful I will report it here
Component(s)
exporter/prometheus
What happened?
Description
On otel collector contrib 0.111.0 running as a systemd service, when exposing metrics with Prometheus exporter I can see that sometimes some metrics are missing.
For example http_server_duration_milliseconds_count would most times return 44 samples, but some times return 6 samples, even running the test on the same second
I tested this with curl to discard possible scraping errors:
Of course, this causes the scrape to sometimes have missing data seemingly at random, causing "gaps" on data.
After digging around, I isolated the issue to at least one histogram metric. When I filter out this metric, the issue goes away
http.server.duration{service.name=ps-sac-fe}
In other words, it seem this histogram is somehow breaking the prometheus exporter
Steps to Reproduce
This issue happened in a production collector. I'm still not really sure why it's happening, but I exported the metric that seems to be the culprit to debug, and added it to the "Log output" session. I am not sure what would be wrong with the metric, but when I filter the metric out, the issue does go away
Expected Result
Multiple curl's to localhost/metrics should return the same number of timeseries somewhat consistently
Actual Result
Multiple curl's to localhost/metrics return different number of timeseries: some timeseries are missing seemingly at random, even when the curl is made at the same minute or even same second
If any other test is needed, please let me know
Collector version
v0.111.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
Additional context
No response