smart-edge-open / converged-edge-experience-kits

Source code for experience kits with Ansible-based deployment.
Apache License 2.0
37 stars 40 forks source link

Issue with Telemetry cAdvisor and Collecd #61

Closed amitinfo2k closed 3 years ago

amitinfo2k commented 4 years ago

We have deployed the release '20.06-ovn-fix' we are seeing issue with Telemetry Prometheus Targets for cAdvisor and Collecd showing DOWN. Telemetry_Issues

Following is the cluster information and log snippet. Let me know what other information required for debugging.

[root@openmaster cdn-transcode]# kubectl get nodes
NAME         STATUS   ROLES    AGE     VERSION
openmaster   Ready    master   3h40m   v1.18.4
openworker   Ready    worker   154m    v1.18.4
[root@openmaster cdn-transcode]# kubectl get pods -n telemetry
NAME                                          READY   STATUS      RESTARTS   AGE
cadvisor-s9w77                                2/2     Running     0          135m
collectd-9g4g2                                2/2     Running     0          135m
custom-metrics-apiserver-54699b845f-n96sh     1/1     Running     0          3h8m
grafana-6b79c984b-47snl                       2/2     Running     0          174m
otel-collector-7d5b75bbdf-5t9hb               2/2     Running     0          3h8m
prometheus-node-exporter-j2pn7                1/1     Running     0          135m
prometheus-server-76c96b9497-f48gp            3/3     Running     0          3h9m
telemetry-aware-scheduling-68467c4ccd-s24bj   2/2     Running     0          176m
telemetry-collector-certs-8d6q6               0/1     Completed   0          3h8m
telemetry-node-certs-jf2ct                    1/1     Running     0          135m

kubectl logs -f -n telemetry cadvisor-s9w77 -c cadvisor

2020/09/15 16:00:00 http: superfluous response.WriteHeader call from github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:306)
W0915 16:00:04.666679       1 watcher.go:87] Error while processing event ("/sys/fs/cgroup/memory/system.slice/run-1749.scope": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/system.slice/run-1749.scope: no such file or directory
W0915 16:00:04.666804       1 watcher.go:87] Error while processing event ("/sys/fs/cgroup/devices/system.slice/run-1749.scope": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/run-1749.scope: no such file or directory
W0915 16:00:04.666859       1 watcher.go:87] Error while processing event ("/sys/fs/cgroup/pids/system.slice/run-1749.scope": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/pids/system.slice/run-1749.scope: no such file or directory
2020/09/15 16:00:05 http: superfluous response.WriteHeader call from github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:306)
2020/09/15 16:00:10 http: superfluous response.WriteHeader call from github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:306)
2020/09/15 16:00:15 http: superfluous response.WriteHeader call from github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:306)
2020/09/15 16:00:20 http: superfluous response.WriteHeader call from github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:306)

kubectl logs -f -n telemetry cadvisor-s9w77 -c cadvisor-proxy

10.16.0.11 - - [15/Sep/2020:16:04:05 +0000] "GET /metrics HTTP/1.1" 200 720701 "-" "Prometheus/2.16.0"
10.16.0.11 - - [15/Sep/2020:16:04:10 +0000] "GET /metrics HTTP/1.1" 200 245565 "-" "Prometheus/2.16.0"
10.16.0.11 - - [15/Sep/2020:16:04:15 +0000] "GET /metrics HTTP/1.1" 200 393021 "-" "Prometheus/2.16.0"
10.16.0.11 - - [15/Sep/2020:16:04:20 +0000] "GET /metrics HTTP/1.1" 200 491325 "-" "Prometheus/2.16.0"
10.16.0.11 - - [15/Sep/2020:16:04:25 +0000] "GET /metrics HTTP/1.1" 200 311101 "-" "Prometheus/2.16.0"
10.16.0.11 - - [15/Sep/2020:16:04:30 +0000] "GET /metrics HTTP/1.1" 200 458557 "-" "Prometheus/2.16.0"
10.16.0.11 - - [15/Sep/2020:16:04:35 +0000] "GET /metrics HTTP/1.1" 499 0 "-" "Prometheus/2.16.0"
10.16.0.11 - - [15/Sep/2020:16:04:40 +0000] "GET /metrics HTTP/1.1" 200 507709 "-" "Prometheus/2.16.0"
10.16.0.11 - - [15/Sep/2020:16:04:45 +0000] "GET /metrics HTTP/1.1" 499 0 "-" "Prometheus/2.16.0"
10.16.0.11 - - [15/Sep/2020:16:04:50 +0000] "GET /metrics HTTP/1.1" 200 327485 "-" "Prometheus/2.16.0"
10.16.0.11 - - [15/Sep/2020:16:04:55 +0000] "GET /metrics HTTP/1.1" 499 0 "-" "Prometheus/2.16.0"

kubectl logs -f -n telemetry collectd-9g4g2 collectd-proxy

10.16.0.11 - - [15/Sep/2020:16:03:41 +0000] "GET /metrics HTTP/1.1" 502 157 "-" "Prometheus/2.16.0"
10.16.0.11 - - [15/Sep/2020:16:03:46 +0000] "GET /metrics HTTP/1.1" 502 157 "-" "Prometheus/2.16.0"
2020/09/15 16:03:46 [error] 29#29: *1516 connect() failed (111: Connection refused) while connecting to upstream, client: 10.16.0.11, server: collectd, request: "GET /metrics HTTP/1.1", upstream: "http://[::1]:9104/metrics", host: "192.168.0.4:9103"
2020/09/15 16:03:51 [error] 29#29: *1516 connect() failed (111: Connection refused) while connecting to upstream, client: 10.16.0.11, server: collectd, request: "GET /metrics HTTP/1.1", upstream: "http://[::1]:9104/metrics", host: "192.168.0.4:9103"
10.16.0.11 - - [15/Sep/2020:16:03:51 +0000] "GET /metrics HTTP/1.1" 502 157 "-" "Prometheus/2.16.0"
2020/09/15 16:03:56 [error] 29#29: *1516 connect() failed (111: Connection refused) while connecting to upstream, client: 10.16.0.11, server: collectd, request: "GET /metrics HTTP/1.1", upstream: "http://[::1]:9104/metrics", host: "192.168.0.4:9103"
10.16.0.11 - - [15/Sep/2020:16:03:56 +0000] "GET /metrics HTTP/1.1" 502 157 "-" "Prometheus/2.16.0"
2020/09/15 16:04:01 [error] 29#29: *1516 connect() failed (111: Connection refused) while connecting to upstream, client: 10.16.0.11, server: collectd, request: "GET /metrics HTTP/1.1", upstream: "http://[::1]:9104/metrics", host: "192.168.0.4:9103"
10.16.0.11 - - [15/Sep/2020:16:04:01 +0000] "GET /metrics HTTP/1.1" 502 157 "-" "Prometheus/2.16.0"

tomaszwesolowski commented 4 years ago

Hi, Can you provide also logs from collectd container inside collectd pod?

amitinfo2k commented 4 years ago

Nothing is seen on the collectd container logs

[root@openmaster ]# kubectl logs -f  -n telemetry collectd-9g4g2 openssl
Generating key...
Generating certificate signing request...
Signing certificate with /root/ca...
Signature ok
subject=CN = collectd
Getting CA Private Key

[root@openmaster ]# kubectl logs -f  -n telemetry collectd-9g4g2 collectd
^C
jakubrym commented 3 years ago

Hi, Are you still experiencing this issue?

amitinfo2k commented 3 years ago

No right now We applied the following workarounds -- Restarted collectd pod and it started working -- For cAdvisor we tried to increase the scrap interval redeployed promethus and that fixed the cadvisor

jakubrym commented 3 years ago

Great! In this case I'm closing the issue. Please contact us if the issue appears again.