CPU Usage Spikes - Githubissues

ztnel commented 1 year ago

Hi, I wanted to see if there was any insight into why the arm-exporter service causing periodic spikes in CPU usage. Below is a screenshot from my grafana instance of k3s deployment where I am filtering the pods by those running arm-exporter:

スクリーンショット 2023-04-27 18 26 14

My prometheus scrape interval is set to 30s and I can see some spikes are registering peak values 2 data points in a row which means these usage spikes can be happening for over 30s each: スクリーンショット 2023-04-27 18 38 53

Pod Details:

christiansargusingh  18:43:51 windfarm/k3s/monitoring > kubectl describe po arm-exporter-6c4jk -n monitoring
Name:         arm-exporter-6c4jk
Namespace:    monitoring
Priority:     0
Node:         node0/192.168.2.106
Start Time:   Thu, 27 Apr 2023 03:16:48 -0400
Labels:       controller-revision-hash=5c959cd5bf
              k8s-app=arm-exporter
              pod-template-generation=1
Annotations:  <none>
Status:       Running
IP:           10.42.7.3
IPs:
  IP:           10.42.7.3
Controlled By:  DaemonSet/arm-exporter
Containers:
  arm-exporter:
    Container ID:  containerd://dcee23b4abf1cb1540d097001d9de6535a1739583f2f4d4e86c8dc653225143f
    Image:         carlosedp/arm_exporter:latest
    Image ID:      docker.io/carlosedp/arm_exporter@sha256:c2510142e3824686cba8af75826737a8158b25648e29867e262d26f553de5211
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/rpi_exporter
      --web.listen-address=127.0.0.1:9243
    State:          Running
      Started:      Thu, 27 Apr 2023 03:16:54 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  100Mi
    Requests:
      cpu:        50m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q9dcb (ro)
  kube-rbac-proxy:
    Container ID:  containerd://96278b1c97bc033ac53edd7fa906b9a70dea6a93fb82062edddc4f96b89dda80
    Image:         carlosedp/kube-rbac-proxy:v0.5.0
    Image ID:      docker.io/carlosedp/kube-rbac-proxy@sha256:6716a0ee90f058b6052ca37ca8d5effebd7321c766b5de68069cd84383b85780
    Port:          9243/TCP
    Host Port:     9243/TCP
    Args:
      --secure-listen-address=$(IP):9243
      --upstream=http://127.0.0.1:9243/
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
    State:          Running
      Started:      Thu, 27 Apr 2023 03:17:02 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     20m
      memory:  40Mi
    Requests:
      cpu:     10m
      memory:  20Mi
    Environment:
      IP:   (v1:status.podIP)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q9dcb (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-q9dcb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>

System Details: RPi CM3B+ Compute Modules 32 Bit Hypriot OS Version 1.12.3 (Docker 19.03.12, kernel 4.19.97)

Any insight would be appreciated.

lukasmalkmus commented 1 year ago

Hm, interesting. Which features of the exporter are you using? The exporter has three collectors: cpu, gpu and textfile. For every collector, there are individual scrape metrics: rpi_scrape_{name}_collector_duration_seconds and rpi_scrape_{name}_collector_success. You could also run rpi_exporter --help and from there use the correct flags to disable some collectors and see if this solves your problem. E.g. the gpu collector needs the correct vcgencmd set, which might not be the case by default: https://github.com/lukasmalkmus/rpi_exporter/blob/master/collector/gpu.go#L27-L31.

Unfortunately, the README isn't quite up-to-date and this repo probably deserves a proper makeover... But time is my enemy :) I refactored another exporter of mine quite recently so chances are not to bad I can get to this one, as well.

ztnel commented 1 year ago

I dug up the manifest for the arm-exporter daemonset. It looks like it's just running with default flags.

containers:
  - command:
      - /bin/rpi_exporter
      - '--web.listen-address=127.0.0.1:9243'
    image: 'carlosedp/arm_exporter:latest'
    name: arm-exporter
    resources:
      limits:
        cpu: 100m
        memory: 100Mi
      requests:
        cpu: 50m
        memory: 50Mi
    securityContext:
      privileged: true
  - args:
      - '--secure-listen-address=$(IP):9243'
      - '--upstream=http://127.0.0.1:9243/'
      - >-
        --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
    env:
      - name: IP
        valueFrom:
          fieldRef:
            fieldPath: status.podIP

Im not super familiar with Kubernetes manifests but how does the container access vcgencmd metrics on the Pi? I assume there is some kind of volume mount in play? When I find my vcgencmd on one of my nodes I do not get the default /opt/vc/bin/vcgencmd:

HypriotOS/armv7: node@node0 in ~
$ which vcgencmd
/usr/bin/vcgencmd

ztnel commented 1 year ago

I think I have an idea now. The pod CPU usage metrics use the container_cpu_usage_seconds_total metric which I think is relative to the resource limit set in the daemonset in this case 100m (0.1). which is relatively small. If I take my node CPU usage graph and put it inline with the Pod spa usage graph I can see that these spikes in the pods only correspond to roughly 25% CPU usage on the node:

スクリーンショット 2023-04-28 11 56 43

I think it's still pretty large for an exporter service. Not sure if you have any benchmarks available to profile the container runtime.

lukasmalkmus commented 1 year ago

[...] Not sure if you have any benchmarks available to profile the container runtime.

Unfortunately, no. I think you should play around with turning different collectors and setting the correct path for vcgencmd and see how that affects the metrics to pin down the possible problem. For playing with the individual collectors, you don't even need to mess with the rpi_exporter exporter config. Just like Node Exporter, you can tweak which collectors to enable via the Prometheus config: https://github.com/prometheus/node_exporter#filtering-enabled-collectors.

lukasmalkmus / rpi_exporter

CPU Usage Spikes #20