jacksontj / promxy

An aggregating proxy to enable HA prometheus
MIT License
1.12k stars 126 forks source link

Inconsistency and Gaps in Recording Rules #235

Closed glightfoot closed 4 years ago

glightfoot commented 4 years ago

We recently enabled some global recording rules to generate cross-dc metrics, and are noticing some inconsistencies with the values written to the remote-write-exporter. I'll try to illustrate the issues. The rules in question query metrics recorded on the individual prometheus HA pairs and then does topk and bottomk on them. The rules probably don't need to exist this way, but the tool consuming them is simple and expects them this way. Either way, the problem seems to be legit. I've edited the text metric names to remove some internal information, and blocked it out in the screenshots.

Here is one of the recording rules:

  - record: <redacted>:shard:lowest:three:memory:utilization
    expr: |
      bottomk(3,
        (
          <redacted>:batch:memory:utilization:7d
          + ignoring(subproject) <redacted>:app:memory:utilization:7d
        ) / 2
      )

When run from the graph page on promxy, we get the expected results, with no gaps or NaN's: Screen Shot 2019-11-07 at 12 43 56 PM

However, when looking at the recorded metric (scraped by one of the prom HA pairs every 60s (same as the recording rule interval), we see gaps and NaN's (returned for series that should evaluate to the lowest utilization): Screen Shot 2019-11-07 at 12 49 22 PM

I added a debug endpoint to the remote write exporter to get a look at the samples in memory, and this is what we see (edited to only show the pertinent metrics)

curl localhost:8083/debug
struct { BindAddr string "long:\"bind-addr\" description:\"address to listen on\" default:\":8083\""; WritePath string "long:\"write-path\" description:\"url path\" default:\"/receive\""; WriteTextPath string "long:\"write-text-path\" description:\"url path\" default:\"/receive_text\""; MetricsPath string "long:\"metrics-path\" description:\"url path\" default:\"/metrics\""; DebugPath string "long:\"debug-path\" description:\"debug url path\" default:\"/debug\""; TTL time.Duration "long:\"metric-ttl\" description:\"how long until we TTL things out of the map\" required:\"true\"" }{
  BindAddr: ":8083",
  WritePath: "/receive",
  WriteTextPath: "/receive_text",
  MetricsPath: "/metrics",
  DebugPath: "/debug",
  TTL: 300000000000,
}map[string]*prompb.Sample{
  "<redacted>:shard:lowest:three:memory:utilization{sg=\"prod-cluster1\", shard=\"us1\", source=\"promxy-prod-local\"}": &prompb.Sample{
    Value: 18.754261104268178,
    Timestamp: 1573149095294,
  },
  "<redacted>:shard:lowest:three:memory:utilization{sg=\"prod-cluster1\", shard=\"us4\", source=\"promxy-prod-local\"}": &prompb.Sample{
    Value: 19.128453196520148,
    Timestamp: 1573149095294,
  },
  "<redacted>:shard:lowest:three:memory:utilization{sg=\"prod-cluster1\", shard=\"us5\", source=\"promxy-prod-local\"}": &prompb.Sample{
    Value: 18.18628483908949,
    Timestamp: 1573149095294,
  },
  "<redacted>:shard:lowest:three:memory:utilization{sg=\"prod-cluster2\", shard=\"us11\", source=\"promxy-prod-local\"}": &prompb.Sample{
    Value: NaN,
    Timestamp: 1573149095294,
  },
  "<redacted>:shard:lowest:three:memory:utilization{sg=\"prod-cluster2\", shard=\"us14\", source=\"promxy-prod-local\"}": &prompb.Sample{
    Value: NaN,
    Timestamp: 1573149095294,
  },
  "<redacted>:shard:lowest:three:memory:utilization{sg=\"prod-cluster2\", shard=\"us18\", source=\"promxy-prod-local\"}": &prompb.Sample{
    Value: NaN,
    Timestamp: 1573149095294,
  },
}%

Promxy Version: 0.0.54 Remote_write_exporter Version: 0.0.54 (modified to add a debug endpoint) Prometheus Version: 2.13.0 Promxy Config:

##
## Regular prometheus configuration
##
global:
  evaluation_interval: 60s
  external_labels:
    source: promxy-prod-local
rule_files:
  - /etc/prometheus-rules/*.yaml
alerting:
  alert_relabel_configs:
  - source_labels: [alertroute]
    regex: noalert
    action: drop
  - regex: '__replica__'
    action: labeldrop
  alertmanagers:
  - dns_sd_configs:
    - names:
      - alertmanager.monitoring
      type: A
      port: 9093
# remote_write configuration is used by promxy as its local Appender, meaning all
# metrics promxy would "write" (not export) would be sent to this. Examples
# of this include: recording rules, metrics on alerting rules, etc.
remote_write:
  - url: http://localhost:8083/receive
##
### Promxy configuration
##
promxy:
  server_groups:
    # All upstream prometheus service discovery mechanisms are supported with the same
    # markup, all defined in https://github.com/prometheus/prometheus/blob/master/discovery/config/config.go#L33
    - static_configs:
        - targets:
          - prometheus-pod-0-cluster1.platform.<redacted>.com:443
          - prometheus-pod-1-cluster1.platform.<redacted>.com:443
      # labels to be added to metrics retrieved from this server_group
      labels:
        sg: prod-cluster1
      # anti-affinity for merging values in timeseries between hosts in the server_group
      anti_affinity: 20s
      # Controls whether to use remote_read or the prom HTTP API for fetching remote raw data
      remote_read: false
      # ignore_error will make the given security group's response "optional"
      # meaning if this servergroup returns and error and others don't the overall
      # query can still succeed
      ignore_error: false
      # path_prefix defines a prefix to prepend to all queries to hosts in this servergroup
      # path_prefix: /example/prefix
      # options for promxy's HTTP client when talking to hosts in server_groups
      scheme: https
      http_client:
        # dial_timeout controls how long promxy will wait for a connection to the downstream
        # the default is 200ms.
        dial_timeout: 1s
        tls_config:
          insecure_skip_verify: true
    - static_configs:
        - targets:
          - prometheus-pod-0-cluster2.platform.<redacted>.com:443
          - prometheus-pod-1-cluster2.platform.<redacted>.com:443
      # labels to be added to metrics retrieved from this server_group
      labels:
        sg: prod-cluster2
      # anti-affinity for merging values in timeseries between hosts in the server_group
      anti_affinity: 20s
      # Controls whether to use remote_read or the prom HTTP API for fetching remote raw data
      remote_read: false
      # ignore_error will make the given security group's response "optional"
      # meaning if this servergroup returns and error and others don't the overall
      # query can still succeed
      ignore_error: false
      # path_prefix defines a prefix to prepend to all queries to hosts in this servergroup
      # path_prefix: /example/prefix
      # options for promxy's HTTP client when talking to hosts in server_groups
      scheme: https
      http_client:
        # dial_timeout controls how long promxy will wait for a connection to the downstream
        # the default is 200ms.
        dial_timeout: 1s
        tls_config:
          insecure_skip_verify: true
    - static_configs:
        - targets:
          - prometheus-pod-0-cluster3.platform.<redacted>.com:443
          - prometheus-pod-1-cluster3.platform.<redacted>.com:443
      # labels to be added to metrics retrieved from this server_group
      labels:
        sg: prod-cluster3
      # anti-affinity for merging values in timeseries between hosts in the server_group
      anti_affinity: 20s
      # Controls whether to use remote_read or the prom HTTP API for fetching remote raw data
      remote_read: false
      # ignore_error will make the given security group's response "optional"
      # meaning if this servergroup returns and error and others don't the overall
      # query can still succeed
      ignore_error: false
      # path_prefix defines a prefix to prepend to all queries to hosts in this servergroup
      # path_prefix: /example/prefix
      # options for promxy's HTTP client when talking to hosts in server_groups
      scheme: https
      http_client:
        # dial_timeout controls how long promxy will wait for a connection to the downstream
        # the default is 200ms.
        dial_timeout: 1s
        tls_config:
          insecure_skip_verify: true
    - static_configs:
        - targets:
          - prometheus-2-pod-0-cluster1.platform.<redacted>.com:443
          - prometheus-2-pod-1-cluster1.platform.<redacted>.com:443
      # labels to be added to metrics retrieved from this server_group
      labels:
        sg: prod-2-cluster1
      # anti-affinity for merging values in timeseries between hosts in the server_group
      anti_affinity: 20s
      # Controls whether to use remote_read or the prom HTTP API for fetching remote raw data
      remote_read: false
      # ignore_error will make the given security group's response "optional"
      # meaning if this servergroup returns and error and others don't the overall
      # query can still succeed
      ignore_error: false
      # path_prefix defines a prefix to prepend to all queries to hosts in this servergroup
      # path_prefix: /example/prefix
      # options for promxy's HTTP client when talking to hosts in server_groups
      scheme: https
      http_client:
        # dial_timeout controls how long promxy will wait for a connection to the downstream
        # the default is 200ms.
        dial_timeout: 1s
        tls_config:
          insecure_skip_verify: true
    - static_configs:
        - targets:
          - prometheus-2-pod-0-cluster2.platform.<redacted>.com:443
          - prometheus-2-pod-1-cluster2.platform.<redacted>.com:443
      # labels to be added to metrics retrieved from this server_group
      labels:
        sg: prod-2-cluster2
      # anti-affinity for merging values in timeseries between hosts in the server_group
      anti_affinity: 20s
      # Controls whether to use remote_read or the prom HTTP API for fetching remote raw data
      remote_read: false
      # ignore_error will make the given security group's response "optional"
      # meaning if this servergroup returns and error and others don't the overall
      # query can still succeed
      ignore_error: false
      # path_prefix defines a prefix to prepend to all queries to hosts in this servergroup
      # path_prefix: /example/prefix
      # options for promxy's HTTP client when talking to hosts in server_groups
      scheme: https
      http_client:
        # dial_timeout controls how long promxy will wait for a connection to the downstream
        # the default is 200ms.
        dial_timeout: 1s
        tls_config:
          insecure_skip_verify: true
    - static_configs:
        - targets:
          - prometheus-search-pod-0-cluster1.platform.<redacted>.com:443
          - prometheus-search-pod-1-cluster1.platform.<redacted>.com:443
      # labels to be added to metrics retrieved from this server_group
      labels:
        sg: prod-search-cluster1
      # anti-affinity for merging values in timeseries between hosts in the server_group
      anti_affinity: 20s
      # Controls whether to use remote_read or the prom HTTP API for fetching remote raw data
      remote_read: false
      # ignore_error will make the given security group's response "optional"
      # meaning if this servergroup returns and error and others don't the overall
      # query can still succeed
      ignore_error: false
      # path_prefix defines a prefix to prepend to all queries to hosts in this servergroup
      # path_prefix: /example/prefix
      # options for promxy's HTTP client when talking to hosts in server_groups
      scheme: https
      http_client:
        # dial_timeout controls how long promxy will wait for a connection to the downstream
        # the default is 200ms.
        dial_timeout: 1s
        tls_config:
          insecure_skip_verify: true
    - static_configs:
        - targets:
          - prometheus-search-pod-0-cluster2.platform.<redacted>.com:443
          - prometheus-search-pod-1-cluster2.platform.<redacted>.com:443
      # labels to be added to metrics retrieved from this server_group
      labels:
        sg: prod-search-cluster2
      # anti-affinity for merging values in timeseries between hosts in the server_group
      anti_affinity: 20s
      # Controls whether to use remote_read or the prom HTTP API for fetching remote raw data
      remote_read: false
      # ignore_error will make the given security group's response "optional"
      # meaning if this servergroup returns and error and others don't the overall
      # query can still succeed
      ignore_error: false
      # path_prefix defines a prefix to prepend to all queries to hosts in this servergroup
      # path_prefix: /example/prefix
      # options for promxy's HTTP client when talking to hosts in server_groups
      scheme: https
      http_client:
        # dial_timeout controls how long promxy will wait for a connection to the downstream
        # the default is 200ms.
        dial_timeout: 1s
        tls_config:
          insecure_skip_verify: true
    - static_configs:
        - targets:
          - prometheus-search-pod-0-cluster3.platform.<redacted>.com:443
          - prometheus-search-pod-1-cluster3.platform.<redacted>.com:443
      # labels to be added to metrics retrieved from this server_group
      labels:
        sg: prod-search-cluster3
      # anti-affinity for merging values in timeseries between hosts in the server_group
      anti_affinity: 20s
      # Controls whether to use remote_read or the prom HTTP API for fetching remote raw data
      remote_read: false
      # ignore_error will make the given security group's response "optional"
      # meaning if this servergroup returns and error and others don't the overall
      # query can still succeed
      ignore_error: false
      # path_prefix defines a prefix to prepend to all queries to hosts in this servergroup
      # path_prefix: /example/prefix
      # options for promxy's HTTP client when talking to hosts in server_groups
      scheme: https
      http_client:
        # dial_timeout controls how long promxy will wait for a connection to the downstream
        # the default is 200ms.
        dial_timeout: 1s
        tls_config:
          insecure_skip_verify: true

Additional Info:

I'm not sure where to go from here, since the query page seems to always return the correct results with no gaps, and debug logging doesn't seem to produce anything useful.

Thanks again for this incredible project! Let me know if you need any more debugging info

jacksontj commented 4 years ago

Thanks for the detailed report. I have spent some time over the weekend attempting to reproduce this and was partially successful. I set up a single prometheus server behind promxy with the following recording rule:

groups:
- name: test
  rules:
  - record: test_record_metric
    expr: bottomk(1, prometheus_engine_query_duration_seconds_sum)

and the following prometheus config:

global:
  scrape_interval: "20s"
  evaluation_interval: "5s"
  scrape_timeout: "5s"

scrape_configs:
- job_name: "prom"
  metrics_path: "/metrics"
  static_configs:
    - targets:
      - 127.0.0.1:9090
- job_name: "remote_write_exporter"
  metrics_path: "/metrics"
  scrape_interval: 1m
  static_configs:
    - targets:
      - 127.0.0.1:8083

And with this configuration I was unable to reproduce the behavior. However once I started setting --query.lookback-delta to a much shorter value (default is 5m) I was able to reproduce the behavior you were describing. Unfortunately I'm not sure if this is the same situation you are in -- but this lookback-delta situation makes sense. Prometheus has a single lookback-delta configured and in this case the "source" data is every 20s whereas the recorded data is every 1m -- meaning if the lookback-delta is shorter than a minute you can get holes in the graphs.

@glightfoot Do you have --query.lookback-delta set on either prometheus or promxy? If not, any ideas on the delta between our setups?

glightfoot commented 4 years ago

Thanks for taking a look! No, I don't have lookback-delta configured anywhere, but perhaps this has to do with metricttl? Here's the configs for our Prometheus, Promxy, and Remote-write-exporter containers:

Prometheus

      - name: prometheus
        image: {{ $.Values.prometheusImage }}
        imagePullPolicy: IfNotPresent
        command:
        - prometheus
        args:
        - --config.file=/etc/prometheus/prometheus.yaml
        - --storage.tsdb.path=/prometheus
        - --storage.tsdb.retention.time=90d
        - --storage.tsdb.wal-compression
        - --web.external-url={{ $.Values.externalUrl }}
        - --web.enable-lifecycle
        - --web.enable-admin-api
        - --log.format=json

Promxy

      - name: promxy
        image: {{ $.Values.promxyImage }}
        imagePullPolicy: IfNotPresent
        command:
        - /usr/bin/promxy
        args:
        - --config=/etc/promxy/config.yaml
        - --bind-addr=":{{ $.Values.promxyPort }}"
        - --web.external-url={{ $.Values.externalUrl }}
        - --web.enable-lifecycle
        - --query.max-samples=50000000
        - --query.timeout=3m
        - --access-log-destination=stdout
        - --log-level=info
        - --http.shutdown-timeout=60s

RWE

      - name: remote-write-exporter
        image: {{ $.Values.promxyImage }}
        imagePullPolicy: IfNotPresent
        command:
        - /usr/bin/remote_write_exporter_mc
        args:
        - --bind-addr=":{{ $.Values.remoteWriteExporterPort }}"
        - --metric-ttl=5m
jacksontj commented 4 years ago

The ttl there should be fine as long as it's longer than the expected send interval. In this case the 5m should be sufficient as it's much more than the expected 1m.

Unfortunately I'm not seeing any differences in our configs (other than metric retention). Are you able to repro locally with containers or something?

jacksontj commented 4 years ago

I'm going to close this one out as its been silent for a while. If this is still an issue please feel free to re-open