prometheus-community / helm-charts

Prometheus community Helm charts
Apache License 2.0
4.98k stars 4.99k forks source link

[kube-prometheus-stack] grafana: Readiness probe failed: connect: connection refused #4251

Open AndreasMurk opened 7 months ago

AndreasMurk commented 7 months ago

Describe the bug a clear and concise description of what the bug is.

Hi!

I have deployed the kube-prometheus-stack using FluxCD with the latest 56.6.2 version.

Prometheus along with Loki works fine. However, Grafana has some problems after a while.

It lasted approximately 60 minutes to start up fully until all migrations have been done. Then, whenever I make changes in the Dashboard (eg. adding a new data source) the pod fails. After inspecting the logs I have found these error messages:

{"time": "2024-02-14T15:50:37.062173+00:00", "taskName": null, "msg": "Writing /tmp/dashboards/apiserver.json (ascii)", "level": "INFO"}
{"time": "2024-02-14T15:50:37.065761+00:00", "taskName": null, "msg": "Retrying (Retry(total=4, connect=9, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f8f80>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/dashboards/reload", "level": "WARNING"}
{"time": "2024-02-14T15:50:39.266982+00:00", "taskName": null, "msg": "Retrying (Retry(total=3, connect=8, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f90a0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/dashboards/reload", "level": "WARNING"}
{"time": "2024-02-14T15:50:43.669076+00:00", "taskName": null, "msg": "Retrying (Retry(total=2, connect=7, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f9340>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/dashboards/reload", "level": "WARNING"}
{"time": "2024-02-14T15:50:52.471752+00:00", "taskName": null, "msg": "Retrying (Retry(total=1, connect=6, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f96a0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/dashboards/reload", "level": "WARNING"}
{"time": "2024-02-14T15:51:10.074029+00:00", "taskName": null, "msg": "Retrying (Retry(total=0, connect=5, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f9820>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/dashboards/reload", "level": "WARNING"}
{"time": "2024-02-14T15:51:10.076283+00:00", "taskName": null, "msg": "Received unknown exception: HTTPConnectionPool(host='localhost', port=3000): Max retries exceeded with url: /api/admin/provisioning/dashboards/reload (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f9a90>: Failed to establish a new connection: [Errno 111] Connection refused'))\n", "level": "ERROR"}
Traceback (most recent call last):
  File "/app/.venv/lib/python3.12/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/app/.venv/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The pod tries to restart but fails with the aformentioned bug. In Lens it always says: Readiness probe failed: Get "http://192.168.1.247:3000/api/health": dial tcp 192.168.1.247:3000: connect: connection refused

What's your helm version?

3.14.0

What's your kubectl version?

1.29.1

Which chart?

kube-prometheus-stack

What's the chart version?

56.6.2

What happened?

Making changes in the Dashboard (eg. adding new data sources such as Loki) fails with the stated Python error.

What I have also encountered is that since the newest release, the Dashboard seems slower than with previous releases.

What you expected to happen?

Dashboard should correctly set the datasource

How to reproduce it?

  1. Enable Grafana and Loki in values.yaml
  2. Deploy using FluxCD or helm
  3. Add new Loki Datasource
  4. Check if Dashboard / Pod is still running
  5. Additionally check logs

Enter the changed values of values.yaml?

prometheus: ingress: enabled: true annotations: cert-manager.io/cluster-issuer: "letsencrypt-issuer" kubernetes.io/ingressClassName: nginx nginx.ingress.kubernetes.io/service-upstream: "true"

  # nginx-http-auth config:
nginx.ingress.kubernetes.io/auth-type: basic
  # the name of the secret that contains the htpasswd hash (has to exist beforehand)
nginx.ingress.kubernetes.io/auth-secret: prometheus-htpasswd
  # message to display on auth missing:
nginx.ingress.kubernetes.io/auth-realm: 'Authentication Required - Prometheus'

hosts:
  - prometheus.xxx

path: /
service:
  name: prometheus-prometheus-kube-prometheus-prometheus
  port: 9090
  tls:
    - secretName: prometheus-prod-secret
      hosts:
        - prometheus.xxx
  prometheusSpec:
    replicas: 1
    retention: 168h

    walCompression: true
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: "myBlock"
          resources:
            requests:
              storage: 50Gi

                # scrape all service monitorings without correct labeling
              podMonitorSelectorNilUsesHelmValues: false
              serviceMonitorSelectorNilUsesHelmValues: false
              grafana:
                admin:
                  existingSecret: grafana-admin-secret
              userKey: admin-user
              passwordKey: admin-password
              ingress:
                enabled: true
              annotations: 
              cert-manager.io/cluster-issuer: "letsencrypt-issuer"
              kubernetes.io/ingress.class: nginx
              nginx.ingress.kubernetes.io/service-upstream: "true"
                # nginx-http-auth config:
              nginx.ingress.kubernetes.io/auth-type: basic
                # the name of the secret that contains the htpasswd hash (has to exist beforehand)
              nginx.ingress.kubernetes.io/auth-secret: prometheus-htpasswd
                # message to display on auth missing:
              nginx.ingress.kubernetes.io/auth-realm: 'Authentication Required - Grafana'
              hosts:
                - grafana.xxx
              path: /
              service:
                name: prometheus-grafana
              port: 3000
              tls:
                - secretName: grafana-xxx
              hosts:
                - grafana.xxx
              persistence:
                enabled: true
              type: pvc
              size: 10Gi
              storageClassName: "myStorageClass"

Enter the command that you execute and failing/misfunctioning.

helm install prometheus prometheus-community/kube-prometheus-stack --values values.yaml

Anything else we need to know?

No response

mschaefer-gresham commented 6 months ago

I got this error because the pod couldn't write to the persistent storage location.

martinbe1io commented 6 months ago

same issue here

repositories:
- name: prometheus-community 
  url: https://prometheus-community.github.io/helm-charts 

releases:
- name: kube-prometheus-stack
  namespace: monitoring
  chart: prometheus-community/kube-prometheus-stack
  version: 56.20.0
  installed: true
  values:
    - values.yaml
luislhl commented 1 week ago

I got this error because the pod couldn't write to the persistent storage location.

I can confirm this. I was setting grafana.containerSecurityContext.readOnlyRootFilesystem: true, which was causing the problem.

Removing this fixed for me.

It seems the container only needs to write to /tmp, so a better solution could be to mount only /tmp as writable instead. But I haven't tested this yet.