sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.11k stars 176 forks source link

Error Loading eBPF Objects (softirq_entry) #1765

Closed marvin-steinke closed 4 days ago

marvin-steinke commented 2 weeks ago

What happened?

Kepler seems to have problems with eBPF on my current setup. Kepler logs state:

failed to create eBPF exporter: error loading eBPF objects: field KeplerIrqTrace: program kepler_irq_trace: attach Tracing/TraceRawTp: raw_tp softirq_entry not supported

However softirq_entry is present at /sys/kernel/debug/ on the host. I did find the similar issue #727 which points to to a permission problem. Do I need to configure my host differently?

What did you expect to happen?

Installation succeeds.

How can we reproduce it (as minimally and precisely as possible)?

helm install kepler kepler/kepler --namespace kepler --create-namespace

Anything else we need to know?

OS: Ubuntu 20.04.3 LTS x86_64 Host: SYS-1019GP-TT 0123456789 Kernel: 5.4.0-192-generic CPU: Intel Xeon Silver 4208 (16) @ 3.200GHz GPU: NVIDIA Quadro RTX 5000 GPU: NVIDIA Quadro RTX 5000 Memory: 95208MiB

Kepler image tag

``` I0905 08:13:52.281482 1 gpu.go:38] Trying to initialize GPU collector using dcgm W0905 08:13:52.281702 1 gpu_dcgm.go:104] There is no DCGM daemon running in the host: libdcgm.so not Found W0905 08:13:52.281727 1 gpu_dcgm.go:108] Could not start DCGM. Error: libdcgm.so not Found I0905 08:13:52.281733 1 gpu.go:45] Error initializing dcgm: not able to connect to DCGM: libdcgm.so not Found I0905 08:13:52.281739 1 gpu.go:38] Trying to initialize GPU collector using nvidia-nvml I0905 08:13:52.281789 1 gpu.go:45] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND I0905 08:13:52.281798 1 gpu.go:38] Trying to initialize GPU collector using dummy I0905 08:13:52.281803 1 gpu.go:42] Using dummy to obtain gpu power I0905 08:13:52.285110 1 exporter.go:100] Kepler running on version: v0.7.11 I0905 08:13:52.285158 1 config.go:284] using gCgroup ID in the BPF program: true I0905 08:13:52.285182 1 config.go:286] kernel version: 5.4 I0905 08:13:52.285247 1 config.go:311] The Idle power will be exposed. Are you running on Baremetal or using single VM per node? I0905 08:13:52.285302 1 power.go:53] use sysfs to obtain power I0905 08:13:52.285315 1 redfish.go:169] failed to get redfish credential file path I0905 08:13:52.289657 1 power.go:73] using acpi to obtain power I0905 08:13:52.292851 1 exporter.go:89] Number of CPUs: 16 F0905 08:13:52.412014 1 exporter.go:140] failed to create eBPF exporter: error loading eBPF objects: field KeplerIrqTrace: program kepler_irq_trace: attach Tracing/TraceRawTp: raw_tp softirq_entry not supported ```

Kubernetes version

```console $ kubectl version Client Version: v1.30.4+k3s1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.4+k3s1 ```

Cloud provider or bare metal

Bare Meal

OS version

```console # On Linux: $ cat /etc/os-release NAME="Ubuntu" VERSION="20.04.3 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.3 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal $ uname -a Linux gpu01 5.4.0-192-generic #212-Ubuntu SMP Fri Jul 5 09:47:39 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux ```

Install tools

helm according to the docs with default values

Kepler deployment config

```console $ kubectl describe pod -l app.kubernetes.io/name=kepler -n kepler Name: kepler-9qknx Namespace: kepler Priority: 0 Service Account: kepler Node: gpu01/130.149.248.50 Start Time: Thu, 05 Sep 2024 08:20:19 +0000 Labels: app.kubernetes.io/component=exporter app.kubernetes.io/name=kepler controller-revision-hash=5d8c546465 pod-template-generation=1 Annotations: Status: Running IP: 130.149.248.50 IPs: IP: 130.149.248.50 Controlled By: DaemonSet/kepler Containers: kepler-exporter: Container ID: containerd://9d1f5d14b5ee7e74dc723dc9734efdf1ad4f1d10eb548da8a2631240406107d2 Image: quay.io/sustainable_computing_io/kepler:release-0.7.11 Image ID: quay.io/sustainable_computing_io/kepler@sha256:72e7cd2e866c696900b9b9a33a72fc61a77d06e1c0300b08074784510da4013a Port: 9102/TCP Host Port: 9102/TCP Args: -v=$(KEPLER_LOG_LEVEL) State: Terminated Reason: Error Exit Code: 255 Started: Thu, 05 Sep 2024 08:21:57 +0000 Finished: Thu, 05 Sep 2024 08:21:57 +0000 Last State: Terminated Reason: Error Exit Code: 255 Started: Thu, 05 Sep 2024 08:21:09 +0000 Finished: Thu, 05 Sep 2024 08:21:09 +0000 Ready: False Restart Count: 4 Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5 Environment: NODE_IP: (v1:status.hostIP) NODE_NAME: (v1:spec.nodeName) METRIC_PATH: /metrics BIND_ADDRESS: 0.0.0.0:9102 CGROUP_METRICS: * CPU_ARCH_OVERRIDE: ENABLE_EBPF_CGROUPID: true ENABLE_GPU: true ENABLE_PROCESS_METRICS: false ENABLE_QAT: false EXPOSE_CGROUP_METRICS: false EXPOSE_HW_COUNTER_METRICS: true EXPOSE_IRQ_COUNTER_METRICS: true KEPLER_LOG_LEVEL: 1 Mounts: /lib/modules from lib-modules (rw) /proc from proc (rw) /sys from tracing (rw) /usr/src from usr-src (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jvxx5 (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: lib-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: DirectoryOrCreate tracing: Type: HostPath (bare host directory volume) Path: /sys HostPathType: Directory proc: Type: HostPath (bare host directory volume) Path: /proc HostPathType: Directory usr-src: Type: HostPath (bare host directory volume) Path: /usr/src HostPathType: Directory kube-api-access-jvxx5: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: kubernetes.io/os=linux Tolerations: node-role.kubernetes.io/control-plane:NoSchedule node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 107s default-scheduler Successfully assigned kepler/kepler-9qknx to gpu01 Normal Pulled 107s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:release-0.7.11" in 564ms (564ms including waiting). Image size: 117793827 bytes. Normal Pulled 106s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:release-0.7.11" in 459ms (459ms including waiting). Image size: 117793827 bytes. Normal Pulled 90s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:release-0.7.11" in 552ms (552ms including waiting). Image size: 117793827 bytes. Normal Created 58s (x4 over 107s) kubelet Created container kepler-exporter Normal Started 58s (x4 over 107s) kubelet Started container kepler-exporter Normal Pulled 58s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:release-0.7.11" in 490ms (490ms including waiting). Image size: 117793827 bytes. Warning BackOff 23s (x8 over 105s) kubelet Back-off restarting failed container kepler-exporter in pod kepler-9qknx_kepler(63598b16-bf4d-4d1b-af97-55672ac817b4) Normal Pulling 11s (x5 over 107s) kubelet Pulling image "quay.io/sustainable_computing_io/kepler:release-0.7.11" ```

Container runtime (CRI) and version (if applicable)

Containerd v1.7.20-k3s1

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

marvin-steinke commented 6 days ago

So I installed a newer version of the kernel and this fixed the issue. I think the minimum kernel requirements in the docs should be updated (or maybe I overlooked something?). I'd be happy to do this. Where do you think this should be stated best and what version is the minimum based on the eBPF features used?

dave-tucker commented 6 days ago

Relates to: #1483

5.12 is the minimum supported kernel version: https://github.com/sustainable-computing-io/kepler/issues/1483#issuecomment-2144881310