Closed bygui86 closed 4 months ago
Given that the error is connect: connection refused
-- I expect that the error is with the routing and/or the promxy node not being up. nothing in what you have provided seems to indicate an error, but there isn't a lot to go off of. If there is some error in promxy (e.g. in the log) that would be helpful, but based on what is provided it just seems like some routing/connectivity issue.
@jacksontj thanks for the answer When I added promxy as Grafana Prometheus DataSource I received green message after clicking "Save & Test", so I don't understand why then I receive a "connection refused" error while exploring metrics. I will try again and collect some promxy and maybe Grafana logs...
Can you please confirm that promxy supports the path "/api/v1/query_range"?
Aside, does promxy support also Grafana metrics autocompletion?
Can you please confirm that promxy supports the path "/api/v1/query_range"?
Yes, this is one of the primary query endpoints for the prometheus API
Aside, does promxy support also Grafana metrics autocompletion?
Yes, as these are implemented using 'standard' prometheus API endpoints.
I received green message after clicking "Save & Test", so I don't understand why then I receive a "connection refused" error while exploring metrics.
Based on that I'm guessing there is some panic (causing the process to die), the stdout of promxy should have more information there.
@jacksontj not sure to have found the issue, but some progress down the rabbit hole...
Performing a simple query from Grafana Explore to look for grafana_build_info
or go_info
metrics, I had a successful one but I noticed that Promxy got restarted almost at each query.
and here K8s events
and I was able to collect some logs first at "info" level, then at "trace" level:
PLEASE NOTE: I had to truncate the beginning logs at trace level because it was bigger than 25MB.
I thought "ok maybe it's a lack of resources!", so I increased them till 1 CPU / 1Gi but nothing changed :(
I hope that everything here could be helpful for better debugging!
In general I noticed that queries through Promxy are very very slow compared to regular Prometheus... is that something expected?
Since you are running in k8s; when it fails and restarts could you look at the kubectl pod describe
output for the pod? The describe there will have details on why the pod restarted (oom, exitcode, etc.).
From looking at the logs I'm not seeing anything obvious as to why it would have crashed (no panic log messages), so maybe a resource limit? If so the describe
output should have additional information.
From looking at your trace logs, here are a few things to look at (likely not causing this issue, but probably not helping):
time="2023-11-02T20:55:10Z" level=trace msg="http://10.244.120.71:9443" api=QueryRange error="bad_response: readObjectStart: expect { or n, but found C, error found in #1 byte of ...|Client sent|..., bigger context ...|Client sent an HTTP request to an HTTPS server.\n|..." query=grafana_build_info r="{2023-11-02 19:54:45 +0000 UTC 2023-11-02 20:54:45 +0000 UTC 15s}" took="537.834µs" value="
" warnings="[]"
This is trying to talk to an https downstream over http (incorrect config?)
time="2023-11-02T20:55:05Z" level=trace msg="http://10.244.120.72:8080" api=QueryRange error="Post \"http://10.244.120.72:8080/api/v1/query_range\": dial tcp 10.244.120.72:8080: connect: connection refused" query=grafana_build_info r="{2023-11-02 19:54:45 +0000 UTC 2023-11-02 20:54:45 +0000 UTC 15s}" took="45.75µs" value="
" warnings="[]"
Connection refused; seeing these a lot in the logs
time="2023-11-02T20:55:11Z" level=trace msg="http://10.244.120.75:8080" api=QueryRange error="client_error: client error: 404" query=grafana_build_info r="{2023-11-02 19:54:45 +0000 UTC 2023-11-02 20:54:45 +0000 UTC 15s}" took=2.256ms value="
" warnings="[]"
a 404 on downstream, alsmo maybe incorrect configuration?
In general I noticed that queries through Promxy are very very slow compared to regular Prometheus... is that something expected?
In general promxy is "as fast as the slowest downstream". From the logs here it seems that there may be a number of failing or otherwise bad downstreams -- which could be causing issues. Similarly if there are resource constraints it could be causing performaince impact as well. So IMO once its working in a stable way we can probably make a better determination around perforamnce.
@jacksontj I have no special configuration in Grafana. Just regular ones like:
How should I configure promxy to properly reach the 2 promethei?
Server groups are Prometheus endpoints. The config seems to be configuring all pods as members of the server group which seems unlikely (unless the cluster has only one server group of Prometheus). Likely you'll want 2 server groups one for "infra" and one for "apps".
The config seems to be configuring all pods as members of the server group which seems unlikely (unless the cluster has only one server group of Prometheus). Likely you'll want 2 server groups one for "infra" and one for "apps".
@jacksontj I used the prometheus-operator to deploy 2 different Promethei (using the Prometheus CRD) configured in 2 different ways: "infra" to scrape metrics from node-exporter, kube-state-metrics, grafana and so on; "apps" to scrape metrics from all custom applications. I have 2 different K8s services and the Prometheus configuration are different. So I don't understand why promxy identifies only 1 ServiceGroup.
Here below the Prometheus CRDs I used:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: infra
spec:
image: quay.io/prometheus/prometheus:v2.47.2
imagePullPolicy: IfNotPresent
replicas: 1
serviceAccountName: prometheus
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 1
memory: 1Gi
storage:
volumeClaimTemplate:
metadata:
name: prometheus-infra
labels:
app: prometheus
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
nodeSelector:
kubernetes.io/os: linux
podMetadata:
name: prometheus
labels:
app: prometheus
domain: monitoring
components: infra
containers:
- name: config-reloader
resources:
requests:
cpu: 10m
memory: 16Mi
limits:
cpu: 25m
memory: 16Mi
logFormat: json
logLevel: info
enableFeatures: []
externalLabels:
components: infra
retention: 3d
retentionSize: 4Gi
ruleNamespaceSelector: {}
ruleSelector:
matchLabels:
components: infra # PLEASE NOTE
probeNamespaceSelector: {}
probeSelector:
matchLabels:
components: infra # PLEASE NOTE
podMonitorNamespaceSelector: {}
podMonitorSelector:
matchLabels:
components: infra # PLEASE NOTE
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector:
matchLabels:
components: infra # PLEASE NOTE
alerting:
alertmanagers: []
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: apps
spec:
image: quay.io/prometheus/prometheus:v2.47.2
imagePullPolicy: IfNotPresent
replicas: 1
serviceAccountName: prometheus
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 1
memory: 1Gi
storage:
volumeClaimTemplate:
metadata:
name: prometheus-apps
labels:
app: prometheus
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
nodeSelector:
kubernetes.io/os: linux
podMetadata:
name: prometheus
labels:
app: prometheus
domain: monitoring
components: apps
containers:
- name: config-reloader
resources:
requests:
cpu: 10m
memory: 16Mi
limits:
cpu: 25m
memory: 16Mi
logFormat: json
logLevel: info
enableFeatures: []
externalLabels:
components: apps
retention: 3d
retentionSize: 4Gi
ruleNamespaceSelector: {}
ruleSelector:
matchLabels:
components: apps # PLEASE NOTE
probeNamespaceSelector: {}
probeSelector:
matchLabels:
components: apps # PLEASE NOTE
podMonitorNamespaceSelector: {}
podMonitorSelector:
matchLabels:
components: apps # PLEASE NOTE
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector:
matchLabels:
components: apps # PLEASE NOTE
alerting:
alertmanagers: []
All ServiceMonitor CRDs are then labelled with either components: apps
or components: infra
.
So I don't understand why promxy identifies only 1 ServiceGroup.
I believe I can explain this relatively simply, lets take a look at the configuration you have:
promxy:
server_groups:
- kubernetes_sd_configs:
- role: pod
In this config we have 1 servergroup configured using kubernetes_sd_configs
which includes all pods in the cluster. This has 2 issues (1) we have a single servergroup configured and (2) that single servergroup contains all pods within the k8s cluster (not just a subset of prometheus hosts).
The kubernetes configuration within prometheus is a bit odd -- but at a high-level scoping within the role
is done through relabel_configs
. Some examples can be found here -- but generally youd want to end up with something like:
promxy:
server_groups:
# Servergroup 1 -- for apps prometheus
- kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: prometheus
- source_labels: [__meta_kubernetes_pod_label_domain]
action: keep
regex: monitoring
- source_labels: [__meta_kubernetes_pod_label_components]
action: keep
regex: components
labels:
serverGroup: appServerGroup
# Servergroup 2 -- for infra prometheus
- kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: prometheus
- source_labels: [__meta_kubernetes_pod_label_domain]
action: keep
regex: monitoring
- source_labels: [__meta_kubernetes_pod_label_components]
action: keep
regex: components
labels:
serverGroup: infraServerGroup
To be clear I haven't tested the above configuration but from reading your comments and the prom docs that seems roughly correct. Hopefully that can help guide you on the correct path in configuring promxy :)
@jacksontj thanks a lot for the suggestion! tested or not it's a good starting point!
In this config we have 1 servergroup configured using kubernetes_sd_configs which includes all pods in the cluster. This has 2 issues (1) we have a single servergroup configured and (2) that single servergroup contains all pods within the k8s cluster (not just a subset of prometheus hosts).
To be honest I don't see any part of the promxy doc explaining such issues or possible configurations... they should be added because it's pretty cumbersome otherwise :(
The kubernetes configuration within prometheus is a bit odd
What do you mean exactly here? Which config is odd? Where within Prometheus?
To be honest I don't see any part of the promxy doc explaining such issues or possible configurations... they should be added because it's pretty cumbersome otherwise :(
This is true, but I haven't covered the docs on this because this is exactly the same configuration options for scrape configs in prometheus -- so the "docs" are just a link over there :)
What do you mean exactly here? Which config is odd? Where within Prometheus?
Specifically that the varying sd configs are a bit odd since they are generic. So its all about relabel_configs for filtering (where as k8s would have you generally do label matchers instead). I specifically stuck with prometheus' config here (1) to reduce the code drift and (2) be consistent for prometheus users.
Hi guys,
I deployed Promxy on K8s (minikube) along with 2 Promethei (both managed by prometheus-operator) and Grafana.
I see metrics properly in both Promethei and I added Promxy as Grafana DataSource without issues. But when I try to fetch some metrics from Grafana Explore I get this error:
Post "http://promxy:8082/api/v1/query_range": dial tcp 10.104.63.185:8082: connect: connection refused
As you can see in the screenshot below, I'm perfectly able to fetch metrics from Prometheus UI
Here the Promxy config.yaml:
Some versions:
thanks for any help!