jacksontj / promxy

An aggregating proxy to enable HA prometheus
MIT License
1.12k stars 126 forks source link

Grafana Explore not able to get any metric #622

Closed bygui86 closed 4 months ago

bygui86 commented 8 months ago

Hi guys,

I deployed Promxy on K8s (minikube) along with 2 Promethei (both managed by prometheus-operator) and Grafana.

I see metrics properly in both Promethei and I added Promxy as Grafana DataSource without issues. But when I try to fetch some metrics from Grafana Explore I get this error: Post "http://promxy:8082/api/v1/query_range": dial tcp 10.104.63.185:8082: connect: connection refused

Screenshot 2023-10-16 at 20 33 56

As you can see in the screenshot below, I'm perfectly able to fetch metrics from Prometheus UI

Screenshot 2023-10-16 at 20 34 05

Here the Promxy config.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: promxy
data:
  config.yaml: |
    ##
    ### Regular prometheus configuration
    ##
    global:
      evaluation_interval: 5s
      external_labels:
        source: promxy

    # remote_write configuration is used by promxy as its local Appender, meaning all
    # metrics promxy would "write" (not export) would be sent to this. Examples
    # of this include: recording rules, metrics on alerting rules, etc.
    remote_write:
      - url: http://localhost:8083/receive

    ##
    ### Promxy configuration
    ##
    promxy:
      server_groups:
        - kubernetes_sd_configs:
            - role: pod

Some versions:

thanks for any help!

jacksontj commented 8 months ago

Given that the error is connect: connection refused -- I expect that the error is with the routing and/or the promxy node not being up. nothing in what you have provided seems to indicate an error, but there isn't a lot to go off of. If there is some error in promxy (e.g. in the log) that would be helpful, but based on what is provided it just seems like some routing/connectivity issue.

bygui86 commented 8 months ago

@jacksontj thanks for the answer When I added promxy as Grafana Prometheus DataSource I received green message after clicking "Save & Test", so I don't understand why then I receive a "connection refused" error while exploring metrics. I will try again and collect some promxy and maybe Grafana logs...

Can you please confirm that promxy supports the path "/api/v1/query_range"?

Aside, does promxy support also Grafana metrics autocompletion?

jacksontj commented 8 months ago

Can you please confirm that promxy supports the path "/api/v1/query_range"?

Yes, this is one of the primary query endpoints for the prometheus API

Aside, does promxy support also Grafana metrics autocompletion?

Yes, as these are implemented using 'standard' prometheus API endpoints.

I received green message after clicking "Save & Test", so I don't understand why then I receive a "connection refused" error while exploring metrics.

Based on that I'm guessing there is some panic (causing the process to die), the stdout of promxy should have more information there.

bygui86 commented 8 months ago

@jacksontj not sure to have found the issue, but some progress down the rabbit hole...

Performing a simple query from Grafana Explore to look for grafana_build_info or go_info metrics, I had a successful one but I noticed that Promxy got restarted almost at each query.

Screenshot 2023-11-02 at 21 35 29

Screenshot 2023-11-02 at 21 42 28

Screenshot 2023-11-02 at 21 42 42

and here K8s events

Screenshot 2023-11-02 at 21 49 26

and I was able to collect some logs first at "info" level, then at "trace" level:

PLEASE NOTE: I had to truncate the beginning logs at trace level because it was bigger than 25MB.

I thought "ok maybe it's a lack of resources!", so I increased them till 1 CPU / 1Gi but nothing changed :(

I hope that everything here could be helpful for better debugging!

In general I noticed that queries through Promxy are very very slow compared to regular Prometheus... is that something expected?

jacksontj commented 8 months ago

Since you are running in k8s; when it fails and restarts could you look at the kubectl pod describe output for the pod? The describe there will have details on why the pod restarted (oom, exitcode, etc.).

From looking at the logs I'm not seeing anything obvious as to why it would have crashed (no panic log messages), so maybe a resource limit? If so the describe output should have additional information.

From looking at your trace logs, here are a few things to look at (likely not causing this issue, but probably not helping):

time="2023-11-02T20:55:10Z" level=trace msg="http://10.244.120.71:9443" api=QueryRange error="bad_response: readObjectStart: expect { or n, but found C, error found in #1 byte of ...|Client sent|..., bigger context ...|Client sent an HTTP request to an HTTPS server.\n|..." query=grafana_build_info r="{2023-11-02 19:54:45 +0000 UTC 2023-11-02 20:54:45 +0000 UTC 15s}" took="537.834µs" value="" warnings="[]"

This is trying to talk to an https downstream over http (incorrect config?)

time="2023-11-02T20:55:05Z" level=trace msg="http://10.244.120.72:8080" api=QueryRange error="Post \"http://10.244.120.72:8080/api/v1/query_range\": dial tcp 10.244.120.72:8080: connect: connection refused" query=grafana_build_info r="{2023-11-02 19:54:45 +0000 UTC 2023-11-02 20:54:45 +0000 UTC 15s}" took="45.75µs" value="" warnings="[]"

Connection refused; seeing these a lot in the logs

time="2023-11-02T20:55:11Z" level=trace msg="http://10.244.120.75:8080" api=QueryRange error="client_error: client error: 404" query=grafana_build_info r="{2023-11-02 19:54:45 +0000 UTC 2023-11-02 20:54:45 +0000 UTC 15s}" took=2.256ms value="" warnings="[]"

a 404 on downstream, alsmo maybe incorrect configuration?

In general I noticed that queries through Promxy are very very slow compared to regular Prometheus... is that something expected?

In general promxy is "as fast as the slowest downstream". From the logs here it seems that there may be a number of failing or otherwise bad downstreams -- which could be causing issues. Similarly if there are resource constraints it could be causing performaince impact as well. So IMO once its working in a stable way we can probably make a better determination around perforamnce.

bygui86 commented 7 months ago

@jacksontj I have no special configuration in Grafana. Just regular ones like:

How should I configure promxy to properly reach the 2 promethei?

jacksontj commented 7 months ago

Server groups are Prometheus endpoints. The config seems to be configuring all pods as members of the server group which seems unlikely (unless the cluster has only one server group of Prometheus). Likely you'll want 2 server groups one for "infra" and one for "apps".

bygui86 commented 7 months ago

The config seems to be configuring all pods as members of the server group which seems unlikely (unless the cluster has only one server group of Prometheus). Likely you'll want 2 server groups one for "infra" and one for "apps".

@jacksontj I used the prometheus-operator to deploy 2 different Promethei (using the Prometheus CRD) configured in 2 different ways: "infra" to scrape metrics from node-exporter, kube-state-metrics, grafana and so on; "apps" to scrape metrics from all custom applications. I have 2 different K8s services and the Prometheus configuration are different. So I don't understand why promxy identifies only 1 ServiceGroup.

Here below the Prometheus CRDs I used:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: infra
spec:
  image: quay.io/prometheus/prometheus:v2.47.2
  imagePullPolicy: IfNotPresent
  replicas: 1
  serviceAccountName: prometheus
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
  resources:
    requests:
      cpu: 1
      memory: 1Gi
    limits:
      cpu: 1
      memory: 1Gi
  storage:
    volumeClaimTemplate:
      metadata:
        name: prometheus-infra
        labels:
          app: prometheus
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
  nodeSelector:
    kubernetes.io/os: linux
  podMetadata:
    name: prometheus
    labels:
      app: prometheus
      domain: monitoring
      components: infra
  containers:
    - name: config-reloader
      resources:
        requests:
          cpu: 10m
          memory: 16Mi
        limits:
          cpu: 25m
          memory: 16Mi
  logFormat: json
  logLevel: info
  enableFeatures: []
  externalLabels:
    components: infra
  retention: 3d
  retentionSize: 4Gi
  ruleNamespaceSelector: {}
  ruleSelector:
    matchLabels:
      components: infra   # PLEASE NOTE
  probeNamespaceSelector: {}
  probeSelector:
    matchLabels:
      components: infra   # PLEASE NOTE
  podMonitorNamespaceSelector: {}
  podMonitorSelector:
    matchLabels:
      components: infra   # PLEASE NOTE
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchLabels:
      components: infra   # PLEASE NOTE
  alerting:
    alertmanagers: []

---

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: apps
spec:
  image: quay.io/prometheus/prometheus:v2.47.2
  imagePullPolicy: IfNotPresent
  replicas: 1
  serviceAccountName: prometheus
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
  resources:
    requests:
      cpu: 1
      memory: 1Gi
    limits:
      cpu: 1
      memory: 1Gi
  storage:
    volumeClaimTemplate:
      metadata:
        name: prometheus-apps
        labels:
          app: prometheus
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
  nodeSelector:
    kubernetes.io/os: linux
  podMetadata:
    name: prometheus
    labels:
      app: prometheus
      domain: monitoring
      components: apps
  containers:
    - name: config-reloader
      resources:
        requests:
          cpu: 10m
          memory: 16Mi
        limits:
          cpu: 25m
          memory: 16Mi
  logFormat: json
  logLevel: info
  enableFeatures: []
  externalLabels:
    components: apps
  retention: 3d
  retentionSize: 4Gi
  ruleNamespaceSelector: {}
  ruleSelector:
    matchLabels:
      components: apps   # PLEASE NOTE
  probeNamespaceSelector: {}
  probeSelector:
    matchLabels:
      components: apps   # PLEASE NOTE
  podMonitorNamespaceSelector: {}
  podMonitorSelector:
    matchLabels:
      components: apps   # PLEASE NOTE
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchLabels:
      components: apps   # PLEASE NOTE
  alerting:
    alertmanagers: []

All ServiceMonitor CRDs are then labelled with either components: apps or components: infra.

jacksontj commented 6 months ago

So I don't understand why promxy identifies only 1 ServiceGroup.

I believe I can explain this relatively simply, lets take a look at the configuration you have:

    promxy:
      server_groups:
        - kubernetes_sd_configs:
            - role: pod

In this config we have 1 servergroup configured using kubernetes_sd_configs which includes all pods in the cluster. This has 2 issues (1) we have a single servergroup configured and (2) that single servergroup contains all pods within the k8s cluster (not just a subset of prometheus hosts).

The kubernetes configuration within prometheus is a bit odd -- but at a high-level scoping within the role is done through relabel_configs. Some examples can be found here -- but generally youd want to end up with something like:

    promxy:
      server_groups:
    # Servergroup 1 -- for apps prometheus
        - kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_label_app]
              action: keep
              regex: prometheus
            - source_labels: [__meta_kubernetes_pod_label_domain]
              action: keep
              regex: monitoring
            - source_labels: [__meta_kubernetes_pod_label_components]
              action: keep
              regex: components
          labels:
            serverGroup: appServerGroup
    # Servergroup 2 -- for infra prometheus
        - kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_label_app]
              action: keep
              regex: prometheus
            - source_labels: [__meta_kubernetes_pod_label_domain]
              action: keep
              regex: monitoring
            - source_labels: [__meta_kubernetes_pod_label_components]
              action: keep
              regex: components
          labels:
            serverGroup: infraServerGroup

To be clear I haven't tested the above configuration but from reading your comments and the prom docs that seems roughly correct. Hopefully that can help guide you on the correct path in configuring promxy :)

bygui86 commented 6 months ago

@jacksontj thanks a lot for the suggestion! tested or not it's a good starting point!

In this config we have 1 servergroup configured using kubernetes_sd_configs which includes all pods in the cluster. This has 2 issues (1) we have a single servergroup configured and (2) that single servergroup contains all pods within the k8s cluster (not just a subset of prometheus hosts).

To be honest I don't see any part of the promxy doc explaining such issues or possible configurations... they should be added because it's pretty cumbersome otherwise :(

The kubernetes configuration within prometheus is a bit odd

What do you mean exactly here? Which config is odd? Where within Prometheus?

jacksontj commented 6 months ago

To be honest I don't see any part of the promxy doc explaining such issues or possible configurations... they should be added because it's pretty cumbersome otherwise :(

This is true, but I haven't covered the docs on this because this is exactly the same configuration options for scrape configs in prometheus -- so the "docs" are just a link over there :)

What do you mean exactly here? Which config is odd? Where within Prometheus?

Specifically that the varying sd configs are a bit odd since they are generic. So its all about relabel_configs for filtering (where as k8s would have you generally do label matchers instead). I specifically stuck with prometheus' config here (1) to reduce the code drift and (2) be consistent for prometheus users.