rancher / dashboard

The Rancher UI
https://rancher.com
Apache License 2.0
460 stars 261 forks source link

100% of the ingress-nginx/pushprox-ingress-nginx-client targets in cattle-monitoring-system namespace are down #7196

Closed ugurserhattoy closed 2 months ago

ugurserhattoy commented 2 years ago

Hi, There is an alert on cluster dashboard: [100% of the ingress-nginx/pushprox-ingress-nginx-client targets in cattle-monitoring-system namespace are down](https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-targetdown)

I checked the logs of pushprox-ingress-nginx-client pod:

level=info ts=2022-10-09T21:37:54.303Z caller=main.go:232 msg="Got scrape request" scrape_id=7f394d24-98f3-4eda-a795-f64967426f8a url=http://<agent-host-ip>:10254/metrics
level=error ts=2022-10-09T21:37:54.303Z caller=main.go:101 err="failed to scrape http://127.0.0.1:10254/metrics: Get \"http://127.0.0.1:10254/metrics\": dial tcp 127.0.0.1:10254: connect: connection refused"
level=info ts=2022-10-09T21:37:54.304Z caller=main.go:113 msg="Pushed failed scrape response"

And I checked the output of Prometheus Targets:

serviceMonitor/cattle-monitoring-system/rancher-monitoring-ingress-nginx/0 (0/1 up) Endpoint State Labels Last Scrape Scrape Duration Error
http://:10254/metrics DOWN component="ingress-nginx"endpoint="metrics"instance=":10254"job="ingress-nginx"namespace="cattle-monitoring-system"pod="pushprox-ingress-nginx-client-67fdccf9d-qxg8w"service="pushprox-ingress-nginx-client" 25.457s ago 3.319ms server returned HTTP status 500 Internal Server Error

image

from helm chart values.yaml:

rke2IngressNginx:
  clients:
    affinity:
      podAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                    - controller
            namespaces:
              - kube-system
            topologyKey: kubernetes.io/hostname
    deployment:
      enabled: true
      replicas: 1
    port: 10015
    tolerations:
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
    useLocalhost: true
  component: ingress-nginx
  enabled: true
  kubeVersionOverrides:
    - constraint: <= 1.20
      values:
        clients:
          deployment:
            enabled: false
  metricsPort: 10254
<agent-node>:~# k get svc -n cattle-monitoring-system
NAME                                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
alertmanager-operated                         ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP   34h
prometheus-operated                           ClusterIP   None            <none>        9090/TCP                     34h
pushprox-ingress-nginx-client                 ClusterIP   10.43.150.200   <none>        10254/TCP                    34h
pushprox-ingress-nginx-proxy                  ClusterIP   10.43.179.36    <none>        8080/TCP                     34h
pushprox-kube-controller-manager-client       ClusterIP   10.43.60.93     <none>        10257/TCP                    34h
pushprox-kube-controller-manager-proxy        ClusterIP   10.43.1.192     <none>        8080/TCP                     34h
pushprox-kube-etcd-client                     ClusterIP   10.43.240.185   <none>        2381/TCP                     34h
pushprox-kube-etcd-proxy                      ClusterIP   10.43.21.180    <none>        8080/TCP                     34h
pushprox-kube-proxy-client                    ClusterIP   10.43.148.65    <none>        10249/TCP                    34h
pushprox-kube-proxy-proxy                     ClusterIP   10.43.226.62    <none>        8080/TCP                     34h
pushprox-kube-scheduler-client                ClusterIP   10.43.122.24    <none>        10259/TCP                    34h
pushprox-kube-scheduler-proxy                 ClusterIP   10.43.39.26     <none>        8080/TCP                     34h
rancher-monitoring-alertmanager               ClusterIP   10.43.20.177    <none>        9093/TCP                     34h
rancher-monitoring-grafana                    ClusterIP   10.43.84.131    <none>        80/TCP                       34h
rancher-monitoring-kube-state-metrics         ClusterIP   10.43.98.216    <none>        8080/TCP                     34h
rancher-monitoring-operator                   ClusterIP   10.43.22.230    <none>        443/TCP                      34h
rancher-monitoring-prometheus                 ClusterIP   10.43.97.193    <none>        9090/TCP                     34h
rancher-monitoring-prometheus-adapter         ClusterIP   10.43.171.251   <none>        443/TCP                      34h
rancher-monitoring-prometheus-node-exporter   ClusterIP   10.43.42.19     <none>        9796/TCP                     34h

<agent-node>:~# ss -tulpn
Netid     State      Recv-Q     Send-Q         Local Address:Port          Peer Address:Port     Process
udp       UNCONN     0          0              127.0.0.53%lo:53                 0.0.0.0:*         users:(("systemd-resolve",pid=917,fd=12))
udp       UNCONN     0          0                    0.0.0.0:111                0.0.0.0:*         users:(("rpcbind",pid=881,fd=5),("systemd",pid=1,fd=119))
udp       UNCONN     0          0                    0.0.0.0:8472               0.0.0.0:*
udp       UNCONN     0          0                  127.0.0.1:323                0.0.0.0:*         users:(("chronyd",pid=2297621,fd=5))
udp       UNCONN     0          0                       [::]:111                   [::]:*         users:(("rpcbind",pid=881,fd=7),("systemd",pid=1,fd=121))
udp       UNCONN     0          0                      [::1]:323                   [::]:*         users:(("chronyd",pid=2297621,fd=6))
tcp       LISTEN     0          4096               127.0.0.1:10248              0.0.0.0:*         users:(("kubelet",pid=2335393,fd=22))
tcp       LISTEN     0          4096               127.0.0.1:10249              0.0.0.0:*         users:(("kube-proxy",pid=2225,fd=13))
tcp       LISTEN     0          4096               127.0.0.1:9099               0.0.0.0:*         users:(("calico-node",pid=3607320,fd=9))
tcp       LISTEN     0          4096               127.0.0.1:6443               0.0.0.0:*         users:(("rke2",pid=2335348,fd=18))
tcp       LISTEN     0          4096               127.0.0.1:6444               0.0.0.0:*         users:(("rke2",pid=2335348,fd=8))
tcp       LISTEN     0          4096                 0.0.0.0:111                0.0.0.0:*         users:(("rpcbind",pid=881,fd=4),("systemd",pid=1,fd=118))
tcp       LISTEN     0          4096               127.0.0.1:10256              0.0.0.0:*         users:(("kube-proxy",pid=2225,fd=7))
tcp       LISTEN     0          4096           127.0.0.53%lo:53                 0.0.0.0:*         users:(("systemd-resolve",pid=917,fd=13))
tcp       LISTEN     0          128                  0.0.0.0:22                 0.0.0.0:*         users:(("sshd",pid=994,fd=3))
tcp       LISTEN     0          100                127.0.0.1:25                 0.0.0.0:*         users:(("master",pid=2290290,fd=13))
tcp       LISTEN     0          128                127.0.0.1:6010               0.0.0.0:*         users:(("sshd",pid=3610407,fd=10))
tcp       LISTEN     0          4096               127.0.0.1:10010              0.0.0.0:*         users:(("containerd",pid=2335363,fd=361))
tcp       LISTEN     0          4096                       *:10250                    *:*         users:(("kubelet",pid=2335393,fd=34))
tcp       LISTEN     0          4096                    [::]:111                   [::]:*         users:(("rpcbind",pid=881,fd=6),("systemd",pid=1,fd=120))
tcp       LISTEN     0          128                     [::]:22                    [::]:*         users:(("sshd",pid=994,fd=4))
tcp       LISTEN     0          100                    [::1]:25                    [::]:*         users:(("master",pid=2290290,fd=14))
tcp       LISTEN     0          4096                       *:9369                     *:*         users:(("pushprox-client",pid=2184913,fd=3))
tcp       LISTEN     0          128                    [::1]:6010                  [::]:*         users:(("sshd",pid=3610407,fd=9))
tcp       LISTEN     0          4096                       *:9091                     *:*         users:(("calico-node",pid=3607320,fd=14))
tcp       LISTEN     0          4096                       *:9796                     *:*         users:(("node_exporter",pid=2184852,fd=3))

<agent-node>:~# k get po -n cattle-monitoring-system
NAME                                                      READY   STATUS    RESTARTS   AGE
alertmanager-rancher-monitoring-alertmanager-0            2/2     Running   0          34h
prometheus-rancher-monitoring-prometheus-0                3/3     Running   0          34h
pushprox-ingress-nginx-client-67fdccf9d-qxg8w             1/1     Running   0          54m
pushprox-ingress-nginx-proxy-5497b7dbd-p9mbt              1/1     Running   0          34h
pushprox-kube-controller-manager-client-6vcg6             1/1     Running   0          34h
pushprox-kube-controller-manager-proxy-64f6dc94c6-l5ml2   1/1     Running   0          34h
pushprox-kube-etcd-client-f6qx7                           1/1     Running   0          34h
pushprox-kube-etcd-proxy-55544d768d-pphxx                 1/1     Running   0          34h
pushprox-kube-proxy-client-bmxpx                          1/1     Running   0          34h
pushprox-kube-proxy-client-cx6r2                          1/1     Running   0          34h
pushprox-kube-proxy-client-m8knf                          1/1     Running   0          34h
pushprox-kube-proxy-proxy-85f89bcc4d-cp6cd                1/1     Running   0          34h
pushprox-kube-scheduler-client-wldxg                      1/1     Running   0          34h
pushprox-kube-scheduler-proxy-6cb664c86b-8c7mq            1/1     Running   0          34h
rancher-monitoring-grafana-586df56bff-nlvgz               3/3     Running   0          34h
rancher-monitoring-kube-state-metrics-77ddfd789b-tmjvn    1/1     Running   0          34h
rancher-monitoring-operator-79cdfbcf48-nh9ck              1/1     Running   0          34h
rancher-monitoring-prometheus-adapter-79d8db9697-nvsxv    1/1     Running   0          34h
rancher-monitoring-prometheus-node-exporter-jd8wx         1/1     Running   0          34h
rancher-monitoring-prometheus-node-exporter-s6b2z         1/1     Running   0          34h
rancher-monitoring-prometheus-node-exporter-vb52j         1/1     Running   0          34h

Why am I getting that alert? What am I doing wrong? Still trying to understand the rancher, rke2 etc. sorry if this is a simple question...

nwmac commented 2 months ago

Sorry this question was not responded to - it was quite some time ago - if still relevant, please re-open