Closed glightfoot closed 4 years ago
Thanks for the detailed report. I have spent some time over the weekend attempting to reproduce this and was partially successful. I set up a single prometheus server behind promxy with the following recording rule:
groups:
- name: test
rules:
- record: test_record_metric
expr: bottomk(1, prometheus_engine_query_duration_seconds_sum)
and the following prometheus config:
global:
scrape_interval: "20s"
evaluation_interval: "5s"
scrape_timeout: "5s"
scrape_configs:
- job_name: "prom"
metrics_path: "/metrics"
static_configs:
- targets:
- 127.0.0.1:9090
- job_name: "remote_write_exporter"
metrics_path: "/metrics"
scrape_interval: 1m
static_configs:
- targets:
- 127.0.0.1:8083
And with this configuration I was unable to reproduce the behavior. However once I started setting --query.lookback-delta
to a much shorter value (default is 5m) I was able to reproduce the behavior you were describing. Unfortunately I'm not sure if this is the same situation you are in -- but this lookback-delta
situation makes sense. Prometheus has a single lookback-delta
configured and in this case the "source" data is every 20s whereas the recorded data is every 1m -- meaning if the lookback-delta
is shorter than a minute you can get holes in the graphs.
@glightfoot Do you have --query.lookback-delta
set on either prometheus or promxy? If not, any ideas on the delta between our setups?
Thanks for taking a look! No, I don't have lookback-delta configured anywhere, but perhaps this has to do with metricttl? Here's the configs for our Prometheus, Promxy, and Remote-write-exporter containers:
Prometheus
- name: prometheus
image: {{ $.Values.prometheusImage }}
imagePullPolicy: IfNotPresent
command:
- prometheus
args:
- --config.file=/etc/prometheus/prometheus.yaml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=90d
- --storage.tsdb.wal-compression
- --web.external-url={{ $.Values.externalUrl }}
- --web.enable-lifecycle
- --web.enable-admin-api
- --log.format=json
Promxy
- name: promxy
image: {{ $.Values.promxyImage }}
imagePullPolicy: IfNotPresent
command:
- /usr/bin/promxy
args:
- --config=/etc/promxy/config.yaml
- --bind-addr=":{{ $.Values.promxyPort }}"
- --web.external-url={{ $.Values.externalUrl }}
- --web.enable-lifecycle
- --query.max-samples=50000000
- --query.timeout=3m
- --access-log-destination=stdout
- --log-level=info
- --http.shutdown-timeout=60s
RWE
- name: remote-write-exporter
image: {{ $.Values.promxyImage }}
imagePullPolicy: IfNotPresent
command:
- /usr/bin/remote_write_exporter_mc
args:
- --bind-addr=":{{ $.Values.remoteWriteExporterPort }}"
- --metric-ttl=5m
The ttl there should be fine as long as it's longer than the expected send interval. In this case the 5m should be sufficient as it's much more than the expected 1m.
Unfortunately I'm not seeing any differences in our configs (other than metric retention). Are you able to repro locally with containers or something?
I'm going to close this one out as its been silent for a while. If this is still an issue please feel free to re-open
We recently enabled some global recording rules to generate cross-dc metrics, and are noticing some inconsistencies with the values written to the remote-write-exporter. I'll try to illustrate the issues. The rules in question query metrics recorded on the individual prometheus HA pairs and then does topk and bottomk on them. The rules probably don't need to exist this way, but the tool consuming them is simple and expects them this way. Either way, the problem seems to be legit. I've edited the text metric names to remove some internal information, and blocked it out in the screenshots.
Here is one of the recording rules:
When run from the graph page on promxy, we get the expected results, with no gaps or NaN's:
However, when looking at the recorded metric (scraped by one of the prom HA pairs every 60s (same as the recording rule interval), we see gaps and NaN's (returned for series that should evaluate to the lowest utilization):
I added a debug endpoint to the remote write exporter to get a look at the samples in memory, and this is what we see (edited to only show the pertinent metrics)
Promxy Version: 0.0.54 Remote_write_exporter Version: 0.0.54 (modified to add a debug endpoint) Prometheus Version: 2.13.0 Promxy Config:
Additional Info:
I'm not sure where to go from here, since the query page seems to always return the correct results with no gaps, and debug logging doesn't seem to produce anything useful.
Thanks again for this incredible project! Let me know if you need any more debugging info