Exporter SIGSEGV on Ubuntu 20 hosts (release-0.7.10+)

Robbie558 commented 3 days ago

What happened?

Kepler fails on all Ubuntu 20 hosts in my K8s cluster, producing the following logs:

$ kubectl logs -n monitoring kepler-6hrms
WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I1128 17:15:12.843579       1 exporter.go:103] Kepler running on version: v0.7.12-dirty
I1128 17:15:12.844340       1 config.go:293] using gCgroup ID in the BPF program: true
I1128 17:15:12.844406       1 config.go:295] kernel version: 5.4
I1128 17:15:12.844693       1 power.go:78] Unable to obtain power, use estimate method
I1128 17:15:12.844720       1 redfish.go:169] failed to get redfish credential file path
I1128 17:15:12.853436       1 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I1128 17:15:12.853459       1 power.go:79] using none to obtain power
E1128 17:15:12.853478       1 accelerator.go:154] [DUMMY] doesn't contain GPU
E1128 17:15:12.853507       1 exporter.go:154] failed to init GPU accelerators: no devices found
WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I1128 17:15:12.854860       1 exporter.go:84] Number of CPUs: 2
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x87b273]

goroutine 1 [running]:
github.com/sustainable-computing-io/kepler/pkg/bpf.(*hardwarePerfEvents).close(0x0)
    /workspace/pkg/bpf/exporter.go:274 +0x13
github.com/sustainable-computing-io/kepler/pkg/bpf.(*exporter).Detach(0xc0001a4000)
    /workspace/pkg/bpf/exporter.go:195 +0x15a
github.com/sustainable-computing-io/kepler/pkg/bpf.NewExporter()
    /workspace/pkg/bpf/exporter.go:58 +0x13e
main.main()
    /workspace/cmd/exporter/exporter.go:159 +0x86b

Pods running as expected against U22 hosts in the same cluster

What did you expect to happen?

Kepler runs on Ubuntu 20 hosts

How can we reproduce it (as minimally and precisely as possible)?

Install via helm at latest version against a cluster with virtualised Ubuntu 20 nodes

Anything else we need to know?

Virtualised hosts running on Hyper-V

Kepler image tag

```console quay.io/sustainable_computing_io/kepler:release-0.7.12 ```

Kubernetes version

```console Server Version: v1.31.2 ```

Cloud provider or bare metal

OS version

```console $ cat /etc/os-release NAME="Ubuntu" VERSION="20.04.4 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.4 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal $ uname -a Linux fh1-kubet01 5.4.0-200-generic #220-Ubuntu SMP Fri Sep 27 13:19:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

helm

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=monitoring $ $ kubectl describe ds -n monitoring kepler Name: kepler Selector: app.kubernetes.io/component=exporter,app.kubernetes.io/name=kepler Node-Selector: kubernetes.io/os=linux Labels: app.kubernetes.io/component=exporter app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=kepler app.kubernetes.io/version=release-0.7.12 helm.sh/chart=kepler-0.5.11 Annotations: deprecated.daemonset.template.generation: 1 meta.helm.sh/release-name: kepler meta.helm.sh/release-namespace: monitoring Desired Number of Nodes Scheduled: 7 Current Number of Nodes Scheduled: 7 Number of Nodes Scheduled with Up-to-date Pods: 7 Number of Nodes Scheduled with Available Pods: 1 Number of Nodes Misscheduled: 0 Pods Status: 7 Running / 0 Waiting / 0 Succeeded / 0 Failed Pod Template: Labels: app.kubernetes.io/component=exporter app.kubernetes.io/name=kepler Service Account: kepler Containers: kepler-exporter: Image: quay.io/sustainable_computing_io/kepler:release-0.7.12 Port: 9102/TCP Host Port: 9102/TCP Args: -v=$(KEPLER_LOG_LEVEL) Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5 Environment: NODE_IP: (v1:status.hostIP) NODE_NAME: (v1:spec.nodeName) METRIC_PATH: /metrics BIND_ADDRESS: 0.0.0.0:9102 CGROUP_METRICS: * CPU_ARCH_OVERRIDE: ENABLE_EBPF_CGROUPID: true ENABLE_GPU: true ENABLE_PROCESS_METRICS: false ENABLE_QAT: false EXPOSE_CGROUP_METRICS: false EXPOSE_HW_COUNTER_METRICS: true EXPOSE_IRQ_COUNTER_METRICS: true KEPLER_LOG_LEVEL: 1 Mounts: /lib/modules from lib-modules (rw) /proc from proc (rw) /sys from tracing (rw) /usr/src from usr-src (rw) Volumes: lib-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: DirectoryOrCreate tracing: Type: HostPath (bare host directory volume) Path: /sys HostPathType: Directory proc: Type: HostPath (bare host directory volume) Path: /proc HostPathType: Directory usr-src: Type: HostPath (bare host directory volume) Path: /usr/src HostPathType: Directory

Container runtime (CRI) and version (if applicable)

containerd://1.7.12

Related plugins (CNI, CSI, ...) and versions (if applicable)

CNI - Flannel

Robbie558 commented 2 days ago

Issue appears to have been introduced in release-0.7.10, as I am able to workaround by downgrading to release-0.7.8 of the kepler image.

Robbie558 commented 2 days ago

Possibly similar to issue #636

sustainable-computing-io / kepler