error loading BPF program: invalid argument

sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics

https://sustainable-computing.io

Apache License 2.0

1.11k stars 176 forks source link

error loading BPF program: invalid argument #790

Closed andersonandrei closed 10 months ago

andersonandrei commented 1 year ago

What happened?

Hello, I'm trying to use Kepler now on a machine with access to the counters. But it seems not to be working. On my VMs, I can see it working with the estimations, but now that I'm deploying it in these new machines, I just see 0s as the measurements.

I tried to install Kepler by helm chart or by building it and applying the deployment file afterwords, but I had no success.

When I install it with helm, I can see the following logs:

> kubectl logs kepler-wdsjr -n monitoring
I0713 14:19:44.756647       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756672       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756696       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756735       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756764       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756791       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756817       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756846       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756873       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756911       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756939       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756965       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756990       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757054       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757081       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757113       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757140       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757167       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757196       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757225       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757255       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757283       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757322       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757348       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757373       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757399       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757428       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory

Then, when I query:

> kubectl exec -ti -n monitoring daemonset/kepler -- bash  -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total | grep wskow
# HELP kepler_container_core_joules_total Aggregated RAPL value in core in joules
# TYPE kepler_container_core_joules_total counter
kepler_container_core_joules_total{command="",container_id="13760f6f9db879378c267c91f6a6cec71a3111f8c9a73a39e457756702516919",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-1-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_id="13760f6f9db879378c267c91f6a6cec71a3111f8c9a73a39e457756702516919",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-1-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_id="32b2e453cae162b208d13859bdfc4b4726d22186de5fefe949759c6b4ee6b4af",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-2-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_id="32b2e453cae162b208d13859bdfc4b4726d22186de5fefe949759c6b4ee6b4af",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-2-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_id="808040772cbb31c91c4f4084f1a680c97c237879c83288831ed9614b05f1cb7c",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-9-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="808040772cbb31c91c4f4084f1a680c97c237879c83288831ed9614b05f1cb7c",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-9-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="99ac37763d4a96b89c29e2e079796876fc2cb2d08d3febf54346ebafce0d6d96",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-8-whisksystem-invokerhealthtestaction0"} 0
kepler_container_core_joules_total{command="",container_id="99ac37763d4a96b89c29e2e079796876fc2cb2d08d3febf54346ebafce0d6d96",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-8-whisksystem-invokerhealthtestaction0"} 0

When I try to build by myself using make build-manifest OPTS="PROMETHEUS_DEPLOY" I can see in the logs:

> kubectl logs kepler-exporter-vrvmg -n monitoring
I0713 10:29:47.059485       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0713 10:29:47.065771       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0713 10:29:47.074301       1 exporter.go:151] Kepler running on version: 22f2c84
I0713 10:29:47.074322       1 config.go:212] using gCgroup ID in the BPF program: true
I0713 10:29:47.074352       1 config.go:214] kernel version: 4.19
I0713 10:29:47.074402       1 config.go:174] kernel source dir is set to /usr/share/kepler/kernel_sources
I0713 10:29:47.074449       1 exporter.go:171] EnabledBPFBatchDelete: true
I0713 10:29:47.074505       1 power.go:53] use sysfs to obtain power
I0713 10:29:47.604160       1 exporter.go:184] Initializing the GPU collector
I0713 10:29:47.604444       1 watcher.go:67] Using in cluster k8s config
modprobe: FATAL: Module kheaders not found in directory /lib/modules/4.19.0-24-amd64
chdir(/lib/modules/4.19.0-24-amd64/build): No such file or directory
I0713 10:29:47.710306       1 bcc_attacher.go:74] failed to attach the bpf program: <nil>
I0713 10:29:47.710331       1 bcc_attacher.go:143] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=32]: failed to attach the bpf program: <nil>, from default kernel source.
I0713 10:29:47.710357       1 bcc_attacher.go:146] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64
bpf: Failed to load program: Invalid argument

I0713 10:29:48.571949       1 bcc_attacher.go:150] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=32]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64"
I0713 10:29:48.571986       1 bcc_attacher.go:146] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64
bpf: Failed to load program: Invalid argument

I0713 10:29:49.392366       1 bcc_attacher.go:150] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=32]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64"
I0713 10:29:49.392431       1 bcc_attacher.go:158] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=32]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0713 10:29:49.392483       1 exporter.go:201] failed to start : failed to attach bpf assets: failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=32]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0713 10:29:49.392628       1 exporter.go:228] Started Kepler in 2.318348644s

What is weird is that it complains about the /lib/modules, which are installed in both of machines that I'm using:

> kubectl exec -ti debug-9trwq bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
root@paravance-49:/# nsenter --mount=/proc/1/ns/mnt -- sh -s
# ls /lib/modules
4.19.0-24-amd64
# ls /usr/lib/modules
4.19.0-24-amd64

> kubectl exec -ti debug-x6wlt bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
root@paravance-40:/# nsenter --mount=/proc/1/ns/mnt -- sh -s
# ls /lib/modules
4.19.0-24-amd64
# ls /usr/lib/modules
4.19.0-24-amd64

And finally, the result of the query is the same as above.

Can you help me, please?

PS: In fact, my goal is not to have the measurements from the real countes, I want to validate the Kepler's estimation by crossing them with the powermeters that are installed in these machines. So, if possible, I would like to keep using the estimations. But I also don't know how to specify that.

Can you help me to solve both issues (the main one and the PS), please?

Thank you very much!!

What did you expect to happen?

To get the estimations from Kepler.

How can we reproduce it (as minimally and precisely as possible)?

Following the commands I exemplified above.

Anything else we need to know?

No response

Kepler image tag

image: quay.io/sustainable_computing_io/kepler:latest

Kubernetes version

```console > kubectl version Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:42:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"} ```

Cloud provider or bare metal

bare metal

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} # paste output here # provide kepler deployment description $ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} ``` For standalone: # put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

marceloamaral commented 1 year ago

@rootfs @sunya-ch do we still need to install the kernel headers?

marceloamaral commented 1 year ago

@andersonandrei you need to install the kernel headers in your node.

andersonandrei commented 1 year ago

@marceloamaral , as I showed above, they are installed on both nodes:

> kubectl exec -ti debug-9trwq bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
root@paravance-49:/# nsenter --mount=/proc/1/ns/mnt -- sh -s
# ls /lib/modules
4.19.0-24-amd64
# ls /usr/lib/modules
4.19.0-24-amd64

> kubectl exec -ti debug-x6wlt bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
root@paravance-40:/# nsenter --mount=/proc/1/ns/mnt -- sh -s
# ls /lib/modules
4.19.0-24-amd64
# ls /usr/lib/modules
4.19.0-24-amd64

rootfs commented 1 year ago

@andersonandrei kernel headers are installed in /lib/modules/4.19.0-24-amd64/build, can you double check?

rootfs commented 1 year ago

container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756672 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory is addressed in this PR. Please use the latest image.

rootfs commented 1 year ago

@andersonandrei can you do this on your linux? The reason ebpf cannot find tracepoint might be related to the kernel compilation. You can see how kepler deals with different tracepoint signature here

grep finish_task_switch /proc/kallsyms

Here is my output

# grep finish_task_switch /proc/kallsyms 
ffffffffaaf1f730 t finish_task_switch

andersonandrei commented 1 year ago

@rootfs , please find here both verifications, for the lib modules and the tracepoint:

First node verification:

> kubectl exec -ti debug-w4hjz bash
root@paravance-59:/# nsenter --mount=/proc/1/ns/mnt -- sh -s

# ls /lib/modules/4.19.0-24-amd64
build   modules.alias      modules.builtin      modules.dep      modules.devname  modules.softdep  modules.symbols.bin  updates
kernel  modules.alias.bin  modules.builtin.bin  modules.dep.bin  modules.order    modules.symbols  source

# ls /lib/modules/4.19.0-24-amd64/build
Makefile  Module.symvers  arch  include  scripts  tools

# grep finish_task_switch /proc/kallsyms
ffffffffaaca2f50 t finish_task_switch

Second node:

> kubectl exec -ti debug-n6fsg bash                                                     
root@paravance-62:/# nsenter --mount=/proc/1/ns/mnt -- sh -s

# ls /lib/modules/4.19.0-24-amd64
build   modules.alias      modules.builtin      modules.dep      modules.devname  modules.softdep  modules.symbols.bin  updates
kernel  modules.alias.bin  modules.builtin.bin  modules.dep.bin  modules.order    modules.symbols  source

# ls /lib/modules/4.19.0-24-amd64/build
Makefile  Module.symvers  arch  include  scripts  tools

#  grep finish_task_switch /proc/kallsyms     
ffffffff976a2f50 t finish_task_switch

rootfs commented 1 year ago

@andersonandrei can you go to the kepler pods to check the kernel source?

kubectl exec -ti -n kepler daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/`uname -r`/build; cd /lib/modules/`uname -r`/; ls"

andersonandrei commented 1 year ago

@rootfs ,

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/`uname -r`/build; cd /lib/modules/`uname -r`/; ls -l /usr/lib/modules/; ls"
ls: cannot access '/lib/modules/5.10.0-23-amd64/build': No such file or directory
bash: line 1: cd: /lib/modules/5.10.0-23-amd64/: No such file or directory
total 4
drwxr-xr-x 4 root root 4096 Jul 13 18:35 4.19.0-24-amd64
NGC-DL-CONTAINER-LICENSE  afs  bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

So, it means that Kepler searches for another version of kernel? The 4.19 is also there, but it searches for the 5.10 ?

sunya-ch commented 1 year ago

@rootfs @sunya-ch do we still need to install the kernel headers?

with libbpf no, but the default is using bcc which is still using the kernel header in the image file.

rootfs commented 1 year ago

that's odd, I am not sure why uname -r turns to 5.10.0-23-amd64. Can you run

kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/4.19.0-24-amd64; cd /lib/modules/4.19.0-24-amd64/build; ls"

andersonandrei commented 1 year ago

@rootfs ,

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/4.19.0-24-amd64; cd /lib/modules/4.19.0-24-amd64/build; ls"
total 4484
lrwxrwxrwx  1 root root      38 Apr 29 20:07 build -> /usr/src/linux-headers-4.19.0-24-amd64
drwxr-xr-x 12 root root    4096 May 16 12:22 kernel
-rw-r--r--  1 root root 1143894 Jul 14 10:19 modules.alias
-rw-r--r--  1 root root 1092710 Jul 14 10:19 modules.alias.bin
-rw-r--r--  1 root root    4683 Apr 29 20:07 modules.builtin
-rw-r--r--  1 root root    5999 Jul 14 10:19 modules.builtin.bin
-rw-r--r--  1 root root  436046 Jul 14 10:19 modules.dep
-rw-r--r--  1 root root  594001 Jul 14 10:19 modules.dep.bin
-rw-r--r--  1 root root     456 Jul 14 10:19 modules.devname
-rw-r--r--  1 root root  140020 Apr 29 20:07 modules.order
-rw-r--r--  1 root root     876 Jul 14 10:19 modules.softdep
-rw-r--r--  1 root root  507648 Jul 14 10:19 modules.symbols
-rw-r--r--  1 root root  626742 Jul 14 10:19 modules.symbols.bin
lrwxrwxrwx  1 root root      39 Apr 29 20:07 source -> /usr/src/linux-headers-4.19.0-24-common
drwxr-xr-x  3 root root    4096 Jul 14 10:19 updates
bash: line 1: cd: /lib/modules/4.19.0-24-amd64/build: No such file or directory
NGC-DL-CONTAINER-LICENSE  afs  bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

So it seems that inside the pod there i no /buid even if there is inside the node ?

andersonandrei commented 1 year ago

I don't know if it would help, but I needed to modify the Daemonset definition adding DirectoryOrCreate in the volum type of /lib/modules:

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: kepler-exporter
      sustainable-computing.io/app: kepler
  template:
    metadata:
      labels:
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: kepler-exporter
        sustainable-computing.io/app: kepler
    spec:
      containers:
      - args:
        - /usr/bin/kepler -v=1
        command:
        - /bin/sh
        - -c
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        image: quay.io/sustainable_computing_io/kepler:latest                                                                                                                                                                                                                                                                                                                                                                                                                                   
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /healthz
            port: 9102
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 10
        name: kepler-exporter
        ports:
        - containerPort: 9102
          name: http
        resources:
          requests:
            cpu: 100m
            memory: 400Mi
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /lib/modules
          name: lib-modules
        - mountPath: /sys
          name: tracing
        - mountPath: /proc
          name: proc
        - mountPath: /etc/config
          name: cfm
          readOnly: true
      dnsPolicy: ClusterFirstWithHostNet
      serviceAccountName: kepler-sa
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      volumes:
      - hostPath:
          path: /lib/modules
          type: DirectoryOrCreate
        name: lib-modules
      - hostPath:
          path: /sys
          type: Directory
        name: tracing
      - hostPath:
          path: /proc
          type: Directory
        name: proc
      - configMap:
          name: kepler-cfm
        name: cfm
---

Otherwise, the Kepler pod remains in the Creation phase:

> kubectl describe pod kepler-exporter-8vvtl -n monitoring
Name:             kepler-exporter-8vvtl
Namespace:        monitoring
Priority:         0
Service Account:  kepler-sa
Node:             parasilo-14.rennes.grid5000.fr/172.16.97.14
Start Time:       Fri, 14 Jul 2023 15:13:57 +0200
Labels:           app.kubernetes.io/component=exporter
                  app.kubernetes.io/name=kepler-exporter
                  controller-revision-hash=5b4d46847b
                  pod-template-generation=1
                  sustainable-computing.io/app=kepler
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    DaemonSet/kepler-exporter
Containers:
  kepler-exporter:
    Container ID:  
    Image:         quay.io/sustainable_computing_io/kepler:latest
    Image ID:      
    Port:          9102/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      -c
    Args:
      /usr/bin/kepler -v=1
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  400Mi
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/config from cfm (ro)
      /lib/modules from lib-modules (rw)
      /proc from proc (rw)
      /sys from tracing (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hpt5q (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  Directory
  tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
  cfm:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kepler-cfm
    Optional:  false
  kube-api-access-hpt5q:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    7m5s                 default-scheduler  Successfully assigned monitoring/kepler-exporter-8vvtl to parasilo-14.rennes.grid5000.fr
  Warning  FailedMount  53s (x11 over 7m5s)  kubelet            MountVolume.SetUp failed for volume "lib-modules" : hostPath type check failed: /lib/modules is not a directory
  Warning  FailedMount  31s (x3 over 5m3s)   kubelet            Unable to attach or mount volumes: unmounted volumes=[lib-modules], unattached volumes=[tracing proc cfm kube-api-access-hpt5q lib-modules]: timed out waiting for the condition

rootfs commented 1 year ago

@andersonandrei that sounds the problem :D

In this case, /lib/modules/4.19.0-24-amd64/build is a link on your host. Kepler pod cannot see the /usr/src directory on the host so it cannot find kernel source.

Please use this config as an example to bind mount /usr/src to your kepler pod. Note, this example mounts /usr/src/kernels, while in your setup, the host path is /usr/src

andersonandrei commented 1 year ago

@rootfs,

I just added the /usr/src/kernels and the /sys/kernel/debug, but again I needed to use DirectoryOrCreate, otherwise the pod did not intiate:

> kubectl describe pod kepler-exporter-f2lbq -n monitoring                                                        
Name:             kepler-exporter-f2lbq                                                                           
Namespace:        monitoring                                                                                      
Priority:         0                                                                                               
Service Account:  kepler-sa                                                                                       
Node:             parasilo-14.rennes.grid5000.fr/172.16.97.14                                                     
Start Time:       Fri, 14 Jul 2023 16:00:04 +0200                                                                 
Labels:           app.kubernetes.io/component=exporter                                                            
                  app.kubernetes.io/name=kepler-exporter                                                          
                  controller-revision-hash=765cd98545                                                             
                  pod-template-generation=3                                                                       
                  sustainable-computing.io/app=kepler                                                             
Annotations:      <none>                                                                                          
Status:           Pending                                                                                         
IP:                                                                                                               
IPs:              <none>                                                                                          
Controlled By:    DaemonSet/kepler-exporter                                                                       
Containers:                                                                                                       
  kepler-exporter:                                                                                                
    Container ID:                                                                                                 
    Image:         quay.io/sustainable_computing_io/kepler:latest                                                 
    Image ID:                                                                                                     
    Port:          9102/TCP                                                                                       
    Host Port:     0/TCP                                                                                          
    Command:                                                                                                      
      /bin/sh                                                                                                     
      -c                                                                                                          
    Args:                                                                                                         
      /usr/bin/kepler -v=1                                                                                        
    State:          Waiting                                                                                       
      Reason:       ContainerCreating                                                                             
    Ready:          False                                                                                         
    Restart Count:  0                                                                                             
    Requests:                                                                                                     
      cpu:     100m                                                                                               
      memory:  400Mi                                                                                              
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5                                                                                                                                  
    Environment:                                                                                                  
      NODE_NAME:   (v1:spec.nodeName)                                                                             
    Mounts:                                                                                                       
      /etc/config from cfm (ro)                                                                                   
      /lib/modules from lib-modules (rw)                                                                          
      /proc from proc (rw)                                                                                        
      /sys from tracing (rw)                                                                                      
      /sys/kernel/debug from kernel-debug (rw)                                                                    
      /usr/src/kernels from kernel-src (rw)                                                                       
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hmtdw (ro)                                                                                                                                                  
Conditions:                                                                                                       
  Type              Status                                                                                        
  Initialized       True                                                                                          
  Ready             False                                                                                         
  ContainersReady   False                                                                                         
  PodScheduled      True                                                                                          
Volumes:                                                                                                          
  kernel-debug:                                                                                                   
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /sys/kernel/debug                                                                              
    HostPathType:  Directory                                                                                      
  kernel-src:                                                                                                     
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /usr/src/kernels                                                                               
    HostPathType:  Directory                                                                                      
  lib-modules:                                                                                                    
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /lib/modules                                                                                   
    HostPathType:  DirectoryOrCreate                                                                              
  tracing:                                                                                                        
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /sys                                                                                           
    HostPathType:  Directory                                                                                      
  proc:                                                                                                           
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /proc                                                                                          
    HostPathType:  Directory                                                                                      
  cfm:                                                                                                            
    Type:      ConfigMap (a volume populated by a ConfigMap)                                                      
    Name:      kepler-cfm                                                                                         
    Optional:  false                                                                                              
  kube-api-access-hmtdw:                                                                                          
    Type:                    Projected (a volume that contains injected data from multiple sources)                                                                                                                                  
    TokenExpirationSeconds:  3607                                                                                 
    ConfigMapName:           kube-root-ca.crt                                                                     
    ConfigMapOptional:       <nil>                                                                                
    DownwardAPI:             true                                                                                 
QoS Class:                   Burstable                                                                            
Node-Selectors:              <none>                                                                               
Tolerations:                 node-role.kubernetes.io/master:NoSchedule                                            
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists                                                                                                                                                   
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists                                                                                                                                                 
                             node.kubernetes.io/not-ready:NoExecute op=Exists                                                                                                                                                        
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists                                                                                                                                                    
                             node.kubernetes.io/unreachable:NoExecute op=Exists                                                                                                                                                      
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists                                                                                                                                                   
Events:                                                                                                           
  Type     Reason       Age                From               Message                                             
  ----     ------       ----               ----               -------                                             
  Normal   Scheduled    60s                default-scheduler  Successfully assigned monitoring/kepler-exporter-f2lbq to parasilo-14.rennes.grid5000.fr                                                                               
  Warning  FailedMount  29s (x7 over 60s)  kubelet            MountVolume.SetUp failed for volume "kernel-src" : hostPath type check failed: /usr/src/kernels is not a directory

However, even running the logs do not look good:

  > kubectl logs kepler-exporter-jdh4c -n monitoring
I0714 14:02:12.859441       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open 
shared object file: No such file or directory
I0714 14:02:12.865879       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0714 14:02:12.873527       1 exporter.go:151] Kepler running on version: eba46bc
I0714 14:02:12.873554       1 config.go:212] using gCgroup ID in the BPF program: true
I0714 14:02:12.873575       1 config.go:214] kernel version: 4.19
I0714 14:02:12.873594       1 exporter.go:171] EnabledBPFBatchDelete: true
I0714 14:02:12.873680       1 power.go:53] use sysfs to obtain power
I0714 14:02:12.982707       1 watcher.go:67] Using in cluster k8s config
W0714 14:02:12.989342       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:12.989402       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope                          
W0714 14:02:14.275674       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:14.275742       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope                          
W0714 14:02:17.382623       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:17.382663       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope                          
W0714 14:02:22.712307       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:22.712355       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope

The queries do not work:

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash  -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total         
command terminated with exit code 7

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash  -c "curl localhost:9102/metrics"                                          
curl: (7) Failed to connect to localhost port 9102: Connection refused
command terminated with exit code 7

Then, I just checked again the previous commands you asked me:

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/4.19.0-24-amd64; cd /lib/modules/4.19.0-24-amd64/build; ls"
total 4484
lrwxrwxrwx  1 root root      38 Apr 29 20:07 build -> /usr/src/linux-headers-4.19.0-24-amd64
drwxr-xr-x 12 root root    4096 May 16 12:22 kernel
-rw-r--r--  1 root root 1143894 Jul 14 10:19 modules.alias
-rw-r--r--  1 root root 1092710 Jul 14 10:19 modules.alias.bin
-rw-r--r--  1 root root    4683 Apr 29 20:07 modules.builtin
-rw-r--r--  1 root root    5999 Jul 14 10:19 modules.builtin.bin
-rw-r--r--  1 root root  436046 Jul 14 10:19 modules.dep
-rw-r--r--  1 root root  594001 Jul 14 10:19 modules.dep.bin
-rw-r--r--  1 root root     456 Jul 14 10:19 modules.devname
-rw-r--r--  1 root root  140020 Apr 29 20:07 modules.order
-rw-r--r--  1 root root     876 Jul 14 10:19 modules.softdep
-rw-r--r--  1 root root  507648 Jul 14 10:19 modules.symbols
-rw-r--r--  1 root root  626742 Jul 14 10:19 modules.symbols.bin
lrwxrwxrwx  1 root root      39 Apr 29 20:07 source -> /usr/src/linux-headers-4.19.0-24-common
drwxr-xr-x  3 root root    4096 Jul 14 10:19 updates
bash: line 1: cd: /lib/modules/4.19.0-24-amd64/build: No such file or directory
NGC-DL-CONTAINER-LICENSE  afs  bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/`uname -r`/build; cd /lib/modules/`uname -r`/; ls"
ls: cannot access '/lib/modules/5.10.0-23-amd64/build': No such file or directory
bash: line 1: cd: /lib/modules/5.10.0-23-amd64/: No such file or directory
NGC-DL-CONTAINER-LICENSE  afs  bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

And I can see that the pods now have a lot of CrashLoopBackOff:

monitoring      kepler-exporter-jdh4c                                    0/1     CrashLoopBackOff   4 (83s ago)   8m1s
monitoring      kepler-exporter-rs6lg                                    0/1     CrashLoopBackOff   4 (79s ago)   8m1s

~/kepler main ?1                                                                                                   kube local adasilva@frennes 16:09:28
> kubectl describe pod kepler-exporter-rs6lg -n monitoring
Name:             kepler-exporter-rs6lg
Namespace:        monitoring
Priority:         0
Service Account:  kepler-sa
Node:             parasilo-14.rennes.grid5000.fr/172.16.97.14
Start Time:       Fri, 14 Jul 2023 16:02:09 +0200
Labels:           app.kubernetes.io/component=exporter
                  app.kubernetes.io/name=kepler-exporter
                  controller-revision-hash=54cf48cf9
                  pod-template-generation=4
                  sustainable-computing.io/app=kepler
Annotations:      cni.projectcalico.org/containerID: d3ffb8f7025940f7f3a06d49b68e81276c4be9cda0afa9ce67aaf6090c7eb49e
                  cni.projectcalico.org/podIP: 10.42.2.20/32
                  cni.projectcalico.org/podIPs: 10.42.2.20/32
Status:           Running
IP:               10.42.2.20
IPs:
  IP:           10.42.2.20
Controlled By:  DaemonSet/kepler-exporter
Containers:
  kepler-exporter:
    Container ID:  docker://87c70c90e955fa59301770ef9e139aca051e7f0c6358bf87b6f98ac62fdec52f
    Image:         quay.io/sustainable_computing_io/kepler:latest
    Image ID:      docker-pullable://quay.io/sustainable_computing_io/kepler@sha256:7a3c21442015f0ce471aefc8425268d384eae9651f9d2543a8ec4b60be59b3d6
    Port:          9102/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      -c
    Args:
      /usr/bin/kepler -v=1
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 14 Jul 2023 16:07:51 +0200
      Finished:     Fri, 14 Jul 2023 16:08:51 +0200
    Ready:          False
    Restart Count:  4
    Requests:
      cpu:     100m
      memory:  400Mi
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/config from cfm (ro)
      /lib/modules from lib-modules (rw)
      /proc from proc (rw)
      /sys from tracing (rw)
      /sys/kernel/debug from kernel-debug (rw)
      /usr/src/kernels from kernel-src (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2qns (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kernel-debug:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/kernel/debug
    HostPathType:  DirectoryOrCreate
  kernel-src:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/src/kernels
    HostPathType:  DirectoryOrCreate
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  DirectoryOrCreate
  tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
  cfm:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kepler-cfm
    Optional:  false
  kube-api-access-p2qns:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  7m34s                  default-scheduler  Successfully assigned monitoring/kepler-exporter-rs6lg to parasilo-14.rennes.grid5000.fr
  Normal   Pulled     7m31s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.659839237s
  Normal   Pulled     6m28s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.648322644s
  Normal   Pulled     5m14s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.708725285s
  Warning  Unhealthy  4m34s (x3 over 6m34s)  kubelet            Liveness probe failed: Get "http://10.42.2.20:9102/healthz": dial tcp 10.42.2.20:9102: connect: connection refused
  Normal   Pulling    3m44s (x4 over 7m33s)  kubelet            Pulling image "quay.io/sustainable_computing_io/kepler:latest"
  Normal   Pulled     3m43s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.658658027s
  Normal   Created    3m42s (x4 over 7m31s)  kubelet            Created container kepler-exporter
  Normal   Started    3m42s (x4 over 7m31s)  kubelet            Started container kepler-exporter
  Warning  BackOff    2m9s (x7 over 5m26s)   kubelet            Back-off restarting failed container

rootfs commented 1 year ago

kubectl logs kepler-exporter-jdh4c -n monitoring I0714 14:02:12.859441 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0714 14:02:12.865879 1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM? I0714 14:02:12.873527 1 exporter.go:151] Kepler running on version: eba46bc I0714 14:02:12.873554 1 config.go:212] using gCgroup ID in the BPF program: true I0714 14:02:12.873575 1 config.go:214] kernel version: 4.19 I0714 14:02:12.873594 1 exporter.go:171] EnabledBPFBatchDelete: true I0714 14:02:12.873680 1 power.go:53] use sysfs to obtain power I0714 14:02:12.982707 1 watcher.go:67] Using in cluster k8s config W0714 14:02:12.989342 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:12.989402 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch : failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope
W0714 14:02:14.275674 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:14.275742 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch : failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope
W0714 14:02:17.382623 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:17.382663 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch : failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope
W0714 14:02:22.712307 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:22.712355 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch : failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope

Are you using the latest manifests?

andersonandrei commented 1 year ago

I have built this manifest by make build-manifest OPTS="PROMETHEUS_DEPLOY using the latest version of the repository, and the image latest. For the last logs, I just modified and added the /usr/src/kernels and /sys/kernel/debug entries as suggested.

DaviGn commented 1 year ago

I'm getting a similar error. See the log below:

I0905 12:47:40.674692 1 bcc_attacher.go:253] could not delete bpf table elements, err: Table.Delete: key 0x0: no such file or directory

Does anyone can help me?

sunya-ch commented 1 year ago

@rootfs,

I just added the /usr/src/kernels and the /sys/kernel/debug, but again I needed to use DirectoryOrCreate, otherwise the pod did not intiate:

> kubectl describe pod kepler-exporter-f2lbq -n monitoring                                                        
Name:             kepler-exporter-f2lbq                                                                           
Namespace:        monitoring                                                                                      
Priority:         0                                                                                               
Service Account:  kepler-sa                                                                                       
Node:             parasilo-14.rennes.grid5000.fr/172.16.97.14                                                     
Start Time:       Fri, 14 Jul 2023 16:00:04 +0200                                                                 
Labels:           app.kubernetes.io/component=exporter                                                            
                  app.kubernetes.io/name=kepler-exporter                                                          
                  controller-revision-hash=765cd98545                                                             
                  pod-template-generation=3                                                                       
                  sustainable-computing.io/app=kepler                                                             
Annotations:      <none>                                                                                          
Status:           Pending                                                                                         
IP:                                                                                                               
IPs:              <none>                                                                                          
Controlled By:    DaemonSet/kepler-exporter                                                                       
Containers:                                                                                                       
  kepler-exporter:                                                                                                
    Container ID:                                                                                                 
    Image:         quay.io/sustainable_computing_io/kepler:latest                                                 
    Image ID:                                                                                                     
    Port:          9102/TCP                                                                                       
    Host Port:     0/TCP                                                                                          
    Command:                                                                                                      
      /bin/sh                                                                                                     
      -c                                                                                                          
    Args:                                                                                                         
      /usr/bin/kepler -v=1                                                                                        
    State:          Waiting                                                                                       
      Reason:       ContainerCreating                                                                             
    Ready:          False                                                                                         
    Restart Count:  0                                                                                             
    Requests:                                                                                                     
      cpu:     100m                                                                                               
      memory:  400Mi                                                                                              
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5                                                                                                                                  
    Environment:                                                                                                  
      NODE_NAME:   (v1:spec.nodeName)                                                                             
    Mounts:                                                                                                       
      /etc/config from cfm (ro)                                                                                   
      /lib/modules from lib-modules (rw)                                                                          
      /proc from proc (rw)                                                                                        
      /sys from tracing (rw)                                                                                      
      /sys/kernel/debug from kernel-debug (rw)                                                                    
      /usr/src/kernels from kernel-src (rw)                                                                       
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hmtdw (ro)                                                                                                                                                  
Conditions:                                                                                                       
  Type              Status                                                                                        
  Initialized       True                                                                                          
  Ready             False                                                                                         
  ContainersReady   False                                                                                         
  PodScheduled      True                                                                                          
Volumes:                                                                                                          
  kernel-debug:                                                                                                   
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /sys/kernel/debug                                                                              
    HostPathType:  Directory                                                                                      
  kernel-src:                                                                                                     
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /usr/src/kernels                                                                               
    HostPathType:  Directory                                                                                      
  lib-modules:                                                                                                    
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /lib/modules                                                                                   
    HostPathType:  DirectoryOrCreate                                                                              
  tracing:                                                                                                        
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /sys                                                                                           
    HostPathType:  Directory                                                                                      
  proc:                                                                                                           
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /proc                                                                                          
    HostPathType:  Directory                                                                                      
  cfm:                                                                                                            
    Type:      ConfigMap (a volume populated by a ConfigMap)                                                      
    Name:      kepler-cfm                                                                                         
    Optional:  false                                                                                              
  kube-api-access-hmtdw:                                                                                          
    Type:                    Projected (a volume that contains injected data from multiple sources)                                                                                                                                  
    TokenExpirationSeconds:  3607                                                                                 
    ConfigMapName:           kube-root-ca.crt                                                                     
    ConfigMapOptional:       <nil>                                                                                
    DownwardAPI:             true                                                                                 
QoS Class:                   Burstable                                                                            
Node-Selectors:              <none>                                                                               
Tolerations:                 node-role.kubernetes.io/master:NoSchedule                                            
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists                                                                                                                                                   
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists                                                                                                                                                 
                             node.kubernetes.io/not-ready:NoExecute op=Exists                                                                                                                                                        
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists                                                                                                                                                    
                             node.kubernetes.io/unreachable:NoExecute op=Exists                                                                                                                                                      
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists                                                                                                                                                   
Events:                                                                                                           
  Type     Reason       Age                From               Message                                             
  ----     ------       ----               ----               -------                                             
  Normal   Scheduled    60s                default-scheduler  Successfully assigned monitoring/kepler-exporter-f2lbq to parasilo-14.rennes.grid5000.fr                                                                               
  Warning  FailedMount  29s (x7 over 60s)  kubelet            MountVolume.SetUp failed for volume "kernel-src" : hostPath type check failed: /usr/src/kernels is not a directory

However, even running the logs do not look good:

  > kubectl logs kepler-exporter-jdh4c -n monitoring
I0714 14:02:12.859441       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open 
shared object file: No such file or directory
I0714 14:02:12.865879       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0714 14:02:12.873527       1 exporter.go:151] Kepler running on version: eba46bc
I0714 14:02:12.873554       1 config.go:212] using gCgroup ID in the BPF program: true
I0714 14:02:12.873575       1 config.go:214] kernel version: 4.19
I0714 14:02:12.873594       1 exporter.go:171] EnabledBPFBatchDelete: true
I0714 14:02:12.873680       1 power.go:53] use sysfs to obtain power
I0714 14:02:12.982707       1 watcher.go:67] Using in cluster k8s config
W0714 14:02:12.989342       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:12.989402       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope                          
W0714 14:02:14.275674       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:14.275742       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope                          
W0714 14:02:17.382623       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:17.382663       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope                          
W0714 14:02:22.712307       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:22.712355       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope

The queries do not work:

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash  -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total         
command terminated with exit code 7

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash  -c "curl localhost:9102/metrics"                                          
curl: (7) Failed to connect to localhost port 9102: Connection refused
command terminated with exit code 7

Then, I just checked again the previous commands you asked me:

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/4.19.0-24-amd64; cd /lib/modules/4.19.0-24-amd64/build; ls"
total 4484
lrwxrwxrwx  1 root root      38 Apr 29 20:07 build -> /usr/src/linux-headers-4.19.0-24-amd64
drwxr-xr-x 12 root root    4096 May 16 12:22 kernel
-rw-r--r--  1 root root 1143894 Jul 14 10:19 modules.alias
-rw-r--r--  1 root root 1092710 Jul 14 10:19 modules.alias.bin
-rw-r--r--  1 root root    4683 Apr 29 20:07 modules.builtin
-rw-r--r--  1 root root    5999 Jul 14 10:19 modules.builtin.bin
-rw-r--r--  1 root root  436046 Jul 14 10:19 modules.dep
-rw-r--r--  1 root root  594001 Jul 14 10:19 modules.dep.bin
-rw-r--r--  1 root root     456 Jul 14 10:19 modules.devname
-rw-r--r--  1 root root  140020 Apr 29 20:07 modules.order
-rw-r--r--  1 root root     876 Jul 14 10:19 modules.softdep
-rw-r--r--  1 root root  507648 Jul 14 10:19 modules.symbols
-rw-r--r--  1 root root  626742 Jul 14 10:19 modules.symbols.bin
lrwxrwxrwx  1 root root      39 Apr 29 20:07 source -> /usr/src/linux-headers-4.19.0-24-common
drwxr-xr-x  3 root root    4096 Jul 14 10:19 updates
bash: line 1: cd: /lib/modules/4.19.0-24-amd64/build: No such file or directory
NGC-DL-CONTAINER-LICENSE  afs  bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/`uname -r`/build; cd /lib/modules/`uname -r`/; ls"
ls: cannot access '/lib/modules/5.10.0-23-amd64/build': No such file or directory
bash: line 1: cd: /lib/modules/5.10.0-23-amd64/: No such file or directory
NGC-DL-CONTAINER-LICENSE  afs  bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

And I can see that the pods now have a lot of CrashLoopBackOff:

monitoring      kepler-exporter-jdh4c                                    0/1     CrashLoopBackOff   4 (83s ago)   8m1s
monitoring      kepler-exporter-rs6lg                                    0/1     CrashLoopBackOff   4 (79s ago)   8m1s

~/kepler main ?1                                                                                                   kube local adasilva@frennes 16:09:28
> kubectl describe pod kepler-exporter-rs6lg -n monitoring
Name:             kepler-exporter-rs6lg
Namespace:        monitoring
Priority:         0
Service Account:  kepler-sa
Node:             parasilo-14.rennes.grid5000.fr/172.16.97.14
Start Time:       Fri, 14 Jul 2023 16:02:09 +0200
Labels:           app.kubernetes.io/component=exporter
                  app.kubernetes.io/name=kepler-exporter
                  controller-revision-hash=54cf48cf9
                  pod-template-generation=4
                  sustainable-computing.io/app=kepler
Annotations:      cni.projectcalico.org/containerID: d3ffb8f7025940f7f3a06d49b68e81276c4be9cda0afa9ce67aaf6090c7eb49e
                  cni.projectcalico.org/podIP: 10.42.2.20/32
                  cni.projectcalico.org/podIPs: 10.42.2.20/32
Status:           Running
IP:               10.42.2.20
IPs:
  IP:           10.42.2.20
Controlled By:  DaemonSet/kepler-exporter
Containers:
  kepler-exporter:
    Container ID:  docker://87c70c90e955fa59301770ef9e139aca051e7f0c6358bf87b6f98ac62fdec52f
    Image:         quay.io/sustainable_computing_io/kepler:latest
    Image ID:      docker-pullable://quay.io/sustainable_computing_io/kepler@sha256:7a3c21442015f0ce471aefc8425268d384eae9651f9d2543a8ec4b60be59b3d6
    Port:          9102/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      -c
    Args:
      /usr/bin/kepler -v=1
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 14 Jul 2023 16:07:51 +0200
      Finished:     Fri, 14 Jul 2023 16:08:51 +0200
    Ready:          False
    Restart Count:  4
    Requests:
      cpu:     100m
      memory:  400Mi
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/config from cfm (ro)
      /lib/modules from lib-modules (rw)
      /proc from proc (rw)
      /sys from tracing (rw)
      /sys/kernel/debug from kernel-debug (rw)
      /usr/src/kernels from kernel-src (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2qns (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kernel-debug:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/kernel/debug
    HostPathType:  DirectoryOrCreate
  kernel-src:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/src/kernels
    HostPathType:  DirectoryOrCreate
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  DirectoryOrCreate
  tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
  cfm:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kepler-cfm
    Optional:  false
  kube-api-access-p2qns:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  7m34s                  default-scheduler  Successfully assigned monitoring/kepler-exporter-rs6lg to parasilo-14.rennes.grid5000.fr
  Normal   Pulled     7m31s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.659839237s
  Normal   Pulled     6m28s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.648322644s
  Normal   Pulled     5m14s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.708725285s
  Warning  Unhealthy  4m34s (x3 over 6m34s)  kubelet            Liveness probe failed: Get "http://10.42.2.20:9102/healthz": dial tcp 10.42.2.20:9102: connect: connection refused
  Normal   Pulling    3m44s (x4 over 7m33s)  kubelet            Pulling image "quay.io/sustainable_computing_io/kepler:latest"
  Normal   Pulled     3m43s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.658658027s
  Normal   Created    3m42s (x4 over 7m31s)  kubelet            Created container kepler-exporter
  Normal   Started    3m42s (x4 over 7m31s)  kubelet            Started container kepler-exporter
  Warning  BackOff    2m9s (x7 over 5m26s)   kubelet            Back-off restarting failed container

@andersonandrei The problem is RBAC or scc issue. Are you using openshift-based cluster? If you build manifest from the command it should have permission check clusterrole. If your cluster is based on openshift, you need to add option OPENSHIFT_DEPLOY to bind user to scc.

reference: https://sustainable-computing.io/installation/kepler/

andersonandrei commented 1 year ago

@rootfs, I just added the /usr/src/kernels and the /sys/kernel/debug, but again I needed to use DirectoryOrCreate, otherwise the pod did not intiate:

> kubectl describe pod kepler-exporter-f2lbq -n monitoring                                                        
Name:             kepler-exporter-f2lbq                                                                           
Namespace:        monitoring                                                                                      
Priority:         0                                                                                               
Service Account:  kepler-sa                                                                                       
Node:             parasilo-14.rennes.grid5000.fr/172.16.97.14                                                     
Start Time:       Fri, 14 Jul 2023 16:00:04 +0200                                                                 
Labels:           app.kubernetes.io/component=exporter                                                            
                  app.kubernetes.io/name=kepler-exporter                                                          
                  controller-revision-hash=765cd98545                                                             
                  pod-template-generation=3                                                                       
                  sustainable-computing.io/app=kepler                                                             
Annotations:      <none>                                                                                          
Status:           Pending                                                                                         
IP:                                                                                                               
IPs:              <none>                                                                                          
Controlled By:    DaemonSet/kepler-exporter                                                                       
Containers:                                                                                                       
  kepler-exporter:                                                                                                
    Container ID:                                                                                                 
    Image:         quay.io/sustainable_computing_io/kepler:latest                                                 
    Image ID:                                                                                                     
    Port:          9102/TCP                                                                                       
    Host Port:     0/TCP                                                                                          
    Command:                                                                                                      
      /bin/sh                                                                                                     
      -c                                                                                                          
    Args:                                                                                                         
      /usr/bin/kepler -v=1                                                                                        
    State:          Waiting                                                                                       
      Reason:       ContainerCreating                                                                             
    Ready:          False                                                                                         
    Restart Count:  0                                                                                             
    Requests:                                                                                                     
      cpu:     100m                                                                                               
      memory:  400Mi                                                                                              
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5                                                                                                                                  
    Environment:                                                                                                  
      NODE_NAME:   (v1:spec.nodeName)                                                                             
    Mounts:                                                                                                       
      /etc/config from cfm (ro)                                                                                   
      /lib/modules from lib-modules (rw)                                                                          
      /proc from proc (rw)                                                                                        
      /sys from tracing (rw)                                                                                      
      /sys/kernel/debug from kernel-debug (rw)                                                                    
      /usr/src/kernels from kernel-src (rw)                                                                       
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hmtdw (ro)                                                                                                                                                  
Conditions:                                                                                                       
  Type              Status                                                                                        
  Initialized       True                                                                                          
  Ready             False                                                                                         
  ContainersReady   False                                                                                         
  PodScheduled      True                                                                                          
Volumes:                                                                                                          
  kernel-debug:                                                                                                   
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /sys/kernel/debug                                                                              
    HostPathType:  Directory                                                                                      
  kernel-src:                                                                                                     
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /usr/src/kernels                                                                               
    HostPathType:  Directory                                                                                      
  lib-modules:                                                                                                    
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /lib/modules                                                                                   
    HostPathType:  DirectoryOrCreate                                                                              
  tracing:                                                                                                        
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /sys                                                                                           
    HostPathType:  Directory                                                                                      
  proc:                                                                                                           
    Type:          HostPath (bare host directory volume)                                                          
    Path:          /proc                                                                                          
    HostPathType:  Directory                                                                                      
  cfm:                                                                                                            
    Type:      ConfigMap (a volume populated by a ConfigMap)                                                      
    Name:      kepler-cfm                                                                                         
    Optional:  false                                                                                              
  kube-api-access-hmtdw:                                                                                          
    Type:                    Projected (a volume that contains injected data from multiple sources)                                                                                                                                  
    TokenExpirationSeconds:  3607                                                                                 
    ConfigMapName:           kube-root-ca.crt                                                                     
    ConfigMapOptional:       <nil>                                                                                
    DownwardAPI:             true                                                                                 
QoS Class:                   Burstable                                                                            
Node-Selectors:              <none>                                                                               
Tolerations:                 node-role.kubernetes.io/master:NoSchedule                                            
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists                                                                                                                                                   
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists                                                                                                                                                 
                             node.kubernetes.io/not-ready:NoExecute op=Exists                                                                                                                                                        
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists                                                                                                                                                    
                             node.kubernetes.io/unreachable:NoExecute op=Exists                                                                                                                                                      
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists                                                                                                                                                   
Events:                                                                                                           
  Type     Reason       Age                From               Message                                             
  ----     ------       ----               ----               -------                                             
  Normal   Scheduled    60s                default-scheduler  Successfully assigned monitoring/kepler-exporter-f2lbq to parasilo-14.rennes.grid5000.fr                                                                               
  Warning  FailedMount  29s (x7 over 60s)  kubelet            MountVolume.SetUp failed for volume "kernel-src" : hostPath type check failed: /usr/src/kernels is not a directory

However, even running the logs do not look good:

  > kubectl logs kepler-exporter-jdh4c -n monitoring
I0714 14:02:12.859441       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open 
shared object file: No such file or directory
I0714 14:02:12.865879       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0714 14:02:12.873527       1 exporter.go:151] Kepler running on version: eba46bc
I0714 14:02:12.873554       1 config.go:212] using gCgroup ID in the BPF program: true
I0714 14:02:12.873575       1 config.go:214] kernel version: 4.19
I0714 14:02:12.873594       1 exporter.go:171] EnabledBPFBatchDelete: true
I0714 14:02:12.873680       1 power.go:53] use sysfs to obtain power
I0714 14:02:12.982707       1 watcher.go:67] Using in cluster k8s config
W0714 14:02:12.989342       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:12.989402       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope                          
W0714 14:02:14.275674       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:14.275742       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope                          
W0714 14:02:17.382623       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:17.382663       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope                          
W0714 14:02:22.712307       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:22.712355       1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
 failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope

The queries do not work:

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash  -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total         
command terminated with exit code 7

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash  -c "curl localhost:9102/metrics"                                          
curl: (7) Failed to connect to localhost port 9102: Connection refused
command terminated with exit code 7

Then, I just checked again the previous commands you asked me:

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/4.19.0-24-amd64; cd /lib/modules/4.19.0-24-amd64/build; ls"
total 4484
lrwxrwxrwx  1 root root      38 Apr 29 20:07 build -> /usr/src/linux-headers-4.19.0-24-amd64
drwxr-xr-x 12 root root    4096 May 16 12:22 kernel
-rw-r--r--  1 root root 1143894 Jul 14 10:19 modules.alias
-rw-r--r--  1 root root 1092710 Jul 14 10:19 modules.alias.bin
-rw-r--r--  1 root root    4683 Apr 29 20:07 modules.builtin
-rw-r--r--  1 root root    5999 Jul 14 10:19 modules.builtin.bin
-rw-r--r--  1 root root  436046 Jul 14 10:19 modules.dep
-rw-r--r--  1 root root  594001 Jul 14 10:19 modules.dep.bin
-rw-r--r--  1 root root     456 Jul 14 10:19 modules.devname
-rw-r--r--  1 root root  140020 Apr 29 20:07 modules.order
-rw-r--r--  1 root root     876 Jul 14 10:19 modules.softdep
-rw-r--r--  1 root root  507648 Jul 14 10:19 modules.symbols
-rw-r--r--  1 root root  626742 Jul 14 10:19 modules.symbols.bin
lrwxrwxrwx  1 root root      39 Apr 29 20:07 source -> /usr/src/linux-headers-4.19.0-24-common
drwxr-xr-x  3 root root    4096 Jul 14 10:19 updates
bash: line 1: cd: /lib/modules/4.19.0-24-amd64/build: No such file or directory
NGC-DL-CONTAINER-LICENSE  afs  bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/`uname -r`/build; cd /lib/modules/`uname -r`/; ls"
ls: cannot access '/lib/modules/5.10.0-23-amd64/build': No such file or directory
bash: line 1: cd: /lib/modules/5.10.0-23-amd64/: No such file or directory
NGC-DL-CONTAINER-LICENSE  afs  bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

And I can see that the pods now have a lot of CrashLoopBackOff:

monitoring      kepler-exporter-jdh4c                                    0/1     CrashLoopBackOff   4 (83s ago)   8m1s
monitoring      kepler-exporter-rs6lg                                    0/1     CrashLoopBackOff   4 (79s ago)   8m1s

~/kepler main ?1                                                                                                   kube local adasilva@frennes 16:09:28
> kubectl describe pod kepler-exporter-rs6lg -n monitoring
Name:             kepler-exporter-rs6lg
Namespace:        monitoring
Priority:         0
Service Account:  kepler-sa
Node:             parasilo-14.rennes.grid5000.fr/172.16.97.14
Start Time:       Fri, 14 Jul 2023 16:02:09 +0200
Labels:           app.kubernetes.io/component=exporter
                  app.kubernetes.io/name=kepler-exporter
                  controller-revision-hash=54cf48cf9
                  pod-template-generation=4
                  sustainable-computing.io/app=kepler
Annotations:      cni.projectcalico.org/containerID: d3ffb8f7025940f7f3a06d49b68e81276c4be9cda0afa9ce67aaf6090c7eb49e
                  cni.projectcalico.org/podIP: 10.42.2.20/32
                  cni.projectcalico.org/podIPs: 10.42.2.20/32
Status:           Running
IP:               10.42.2.20
IPs:
  IP:           10.42.2.20
Controlled By:  DaemonSet/kepler-exporter
Containers:
  kepler-exporter:
    Container ID:  docker://87c70c90e955fa59301770ef9e139aca051e7f0c6358bf87b6f98ac62fdec52f
    Image:         quay.io/sustainable_computing_io/kepler:latest
    Image ID:      docker-pullable://quay.io/sustainable_computing_io/kepler@sha256:7a3c21442015f0ce471aefc8425268d384eae9651f9d2543a8ec4b60be59b3d6
    Port:          9102/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      -c
    Args:
      /usr/bin/kepler -v=1
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 14 Jul 2023 16:07:51 +0200
      Finished:     Fri, 14 Jul 2023 16:08:51 +0200
    Ready:          False
    Restart Count:  4
    Requests:
      cpu:     100m
      memory:  400Mi
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/config from cfm (ro)
      /lib/modules from lib-modules (rw)
      /proc from proc (rw)
      /sys from tracing (rw)
      /sys/kernel/debug from kernel-debug (rw)
      /usr/src/kernels from kernel-src (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2qns (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kernel-debug:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/kernel/debug
    HostPathType:  DirectoryOrCreate
  kernel-src:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/src/kernels
    HostPathType:  DirectoryOrCreate
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  DirectoryOrCreate
  tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
  cfm:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kepler-cfm
    Optional:  false
  kube-api-access-p2qns:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  7m34s                  default-scheduler  Successfully assigned monitoring/kepler-exporter-rs6lg to parasilo-14.rennes.grid5000.fr
  Normal   Pulled     7m31s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.659839237s
  Normal   Pulled     6m28s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.648322644s
  Normal   Pulled     5m14s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.708725285s
  Warning  Unhealthy  4m34s (x3 over 6m34s)  kubelet            Liveness probe failed: Get "http://10.42.2.20:9102/healthz": dial tcp 10.42.2.20:9102: connect: connection refused
  Normal   Pulling    3m44s (x4 over 7m33s)  kubelet            Pulling image "quay.io/sustainable_computing_io/kepler:latest"
  Normal   Pulled     3m43s                  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.658658027s
  Normal   Created    3m42s (x4 over 7m31s)  kubelet            Created container kepler-exporter
  Normal   Started    3m42s (x4 over 7m31s)  kubelet            Started container kepler-exporter
  Warning  BackOff    2m9s (x7 over 5m26s)   kubelet            Back-off restarting failed container

reference: https://sustainable-computing.io/installation/kepler/

@sunya-ch No, I'm not using an openshift-based cluster. Do you have any thoughts about how I can fix this problem with RBAC or scc in this case?

Thanks!

sunya-ch commented 1 year ago

@andersonandrei Could you share the result of

kubectl get clusterrole kepler-clusterrole -oyaml

The pod should be added to the resources list by https://github.com/sustainable-computing-io/kepler/commit/bc981ede83b3fdcaf01fa745ec68e7fe6dea405c for apiserver update.

If pods is not there, you can just manually add it to the list and restart the pod.

andersonandrei commented 1 year ago

@andersonandrei Could you share the result of
kubectl get clusterrole kepler-clusterrole -o yaml
The pod should be added to the resources list by bc981ed for apiserver update.

If pods is not there, you can just manually add it to the list and restart the pod.

@sunya-ch , here is the output of the command:

kubectl get clusterrole kepler-clusterrole -oyaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"labels":{"sustainable-computing.io/app":"kepler"},"name":"kepler-clusterrole"},"rules":[{"apiGroups":[""],"resources":["nodes/metrics","nodes/proxy","nodes/stats","pods"],"verbs":["get","watch","list"]}]}
  creationTimestamp: "2023-09-14T09:37:35Z"
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-clusterrole
  resourceVersion: "2314"
  uid: 47353cf2-9466-457d-b4eb-71449333fe83
rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  - nodes/proxy
  - nodes/stats
  - pods
  verbs:
  - get
  - watch
  - list

andersonandrei commented 1 year ago

I just tried again, using the images latest and latest-libbpf and the problem persists :(

rootfs commented 1 year ago

@andersonandrei can you share your yaml? or can you try this (generated from main branch)

kubectl apply -f https://gist.githubusercontent.com/rootfs/24f30eec07d955df9da1b10f8d403d8d/raw/98ab2d2ab93e85a1ed0d3f921f40a2ef013babcd/kepler-eks-bm.yaml

andersonandrei commented 1 year ago

@rootfs , here is the file I'm using:

apiVersion: v1
kind: Namespace
metadata:
  labels:
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: privileged
    security.openshift.io/scc.podSecurityLabelSync: "false"
    sustainable-computing.io/app: kepler
  name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-sa
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    sustainable-computing.io/app: kepler
  name: prometheus-k8s
  namespace: monitoring
rules:
- apiGroups:
  - ""
  resources:
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-clusterrole-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kepler-clusterrole
subjects:
- kind: ServiceAccount
  name: kepler-sa
  namespace: monitoring
---
apiVersion: v1
data:
  BIND_ADDRESS: 0.0.0.0:9102
  CGROUP_METRICS: '*'
  CPU_ARCH_OVERRIDE: ""
  ENABLE_EBPF_CGROUPID: "true"
  ENABLE_GPU: "true"
  ENABLE_PROCESS_METRICS: "false"
  EXPOSE_CGROUP_METRICS: "true"
  EXPOSE_HW_COUNTER_METRICS: "true"
  EXPOSE_IRQ_COUNTER_METRICS: "true"
  EXPOSE_KUBELET_METRICS: "true"
  KEPLER_LOG_LEVEL: "1"
  KEPLER_NAMESPACE: monitoring
  METRIC_PATH: /metrics
  MODEL_CONFIG: |
    CONTAINER_COMPONENTS_ESTIMATOR=false
    # by default we use buildin weight file
    # CONTAINER_COMPONENTS_INIT_URL=https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CgroupOnly/ScikitMixed/ScikitMixed.json
kind: ConfigMap
metadata:
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-cfm
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kepler-exporter
    sustainable-computing.io/app: kepler
  name: kepler-exporter
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http
    port: 9102
    targetPort: http
  selector:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kepler-exporter
    sustainable-computing.io/app: kepler
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: kepler-exporter
      sustainable-computing.io/app: kepler
  template:
    metadata:
      labels:
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: kepler-exporter
        sustainable-computing.io/app: kepler
    spec:
      containers:
      - args:
        - /usr/bin/kepler -v=1 -kernel-source-dir=/usr/share/kepler/kernel_sources
        command:
        - /bin/sh
        - -c
        env:
        - name: NODE_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        image: quay.io/sustainable_computing_io/kepler:latest
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /healthz
            port: 9102
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 10
        name: kepler-exporter
        ports:
        - containerPort: 9102
          name: http
        resources:
          requests:
            cpu: 100m
            memory: 400Mi
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /lib/modules
          name: lib-modules
        - mountPath: /sys
          name: tracing
        - mountPath: /proc
          name: proc
        - mountPath: /etc/kepler/kepler.config
          name: cfm
          readOnly: true
      dnsPolicy: ClusterFirstWithHostNet
      serviceAccountName: kepler-sa
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      volumes:
      - hostPath:
          path: /lib/modules
          type: DirectoryOrCreate
        name: lib-modules
      - hostPath:
          path: /sys
          type: Directory
        name: tracing
      - hostPath:
          path: /proc
          type: Directory
        name: proc
      - configMap:
          name: kepler-cfm
        name: cfm
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kepler-exporter
    sustainable-computing.io/app: kepler
  name: kepler-exporter
  namespace: monitoring
spec:
  endpoints:
  - interval: 3s
    port: http
    relabelings:
    - action: replace
      regex: (.*)
      replacement: $1
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: instance
    scheme: http
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: kepler-exporter

andersonandrei commented 1 year ago

@andersonandrei can you share your yaml? or can you try this (generated from main branch)
kubectl apply -f https://gist.githubusercontent.com/rootfs/24f30eec07d955df9da1b10f8d403d8d/raw/98ab2d2ab93e85a1ed0d3f921f40a2ef013babcd/kepler-eks-bm.yaml

It still the same for me :(

Describing the pod:

> kubectl describe pod kepler-exporter-fx9w8 -n kepler
Name:             kepler-exporter-fx9w8
Namespace:        kepler
Priority:         0
Service Account:  kepler-sa
Node:             troll-3.grenoble.grid5000.fr/172.16.22.3
Start Time:       Thu, 14 Sep 2023 16:08:00 +0200
Labels:           app.kubernetes.io/component=exporter
                  app.kubernetes.io/name=kepler-exporter
                  controller-revision-hash=694f8b95f9
                  pod-template-generation=1
                  sustainable-computing.io/app=kepler
Annotations:      cni.projectcalico.org/containerID: ee33c9db3edec2668ba14f8ec26f528465c69b0ad89b83d85bc653b14da4222f
                  cni.projectcalico.org/podIP: 10.42.1.34/32
                  cni.projectcalico.org/podIPs: 10.42.1.34/32
Status:           Running
IP:               10.42.1.34
IPs:
  IP:           10.42.1.34
Controlled By:  DaemonSet/kepler-exporter
Containers:
  kepler-exporter:
    Container ID:  docker://5a470911f3e43b8ce3a46b565807295a2f6a9224ab97c60d301937fadde285da
    Image:         quay.io/sustainable_computing_io/kepler:latest
    Image ID:      docker-pullable://quay.io/sustainable_computing_io/kepler@sha256:ba56b57466790a2dfb785e4017715bfc3d7c46f059afc029d3e7b03511d69eef
    Port:          9102/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      -c
    Args:
      /usr/bin/kepler -v=1 -kernel-source-dir=/usr/share/kepler/kernel_sources -redfish-cred-file-path=/etc/redfish/redfish.csv
    State:          Running
      Started:      Thu, 14 Sep 2023 16:08:02 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  400Mi
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_IP:     (v1:status.hostIP)
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/kepler/kepler.config from cfm (ro)
      /etc/redfish from redfish (ro)
      /lib/modules from lib-modules (rw)
      /proc from proc (rw)
      /sys from tracing (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qglfh (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  Directory
  tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
  cfm:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kepler-cfm
    Optional:  false
  redfish:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  redfish-4kh9d7bc7m
    Optional:    false
  kube-api-access-qglfh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  87s   default-scheduler  Successfully assigned kepler/kepler-exporter-fx9w8 to troll-3.grenoble.grid5000.fr
  Normal  Pulling    87s   kubelet            Pulling image "quay.io/sustainable_computing_io/kepler:latest"
  Normal  Pulled     85s   kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.768524217s
  Normal  Created    85s   kubelet            Created container kepler-exporter
  Normal  Started    85s   kubelet            Started container kepler-exporter

Logs:

> kubectl logs kepler-exporter-fx9w8 -n kepler
I0914 14:08:02.838663       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0914 14:08:02.845285       1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0914 14:08:02.853165       1 exporter.go:158] Kepler running on version: 5f33240
I0914 14:08:02.853179       1 config.go:272] using gCgroup ID in the BPF program: true
I0914 14:08:02.853198       1 config.go:274] kernel version: 4.19
I0914 14:08:02.853310       1 config.go:299] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0914 14:08:02.853316       1 exporter.go:170] LibbpfBuilt: false, BccBuilt: true
I0914 14:08:02.853348       1 config.go:205] kernel source dir is set to /usr/share/kepler/kernel_sources
I0914 14:08:02.853406       1 exporter.go:189] EnabledBPFBatchDelete: true
I0914 14:08:02.853430       1 power.go:54] use sysfs to obtain power
I0914 14:08:02.853452       1 redfish.go:173] failed to initialize node credential: no supported node credential implementation
I0914 14:08:02.856971       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0914 14:08:02.883220       1 exporter.go:204] Initializing the GPU collector
I0914 14:08:08.888921       1 watcher.go:66] Using in cluster k8s config
modprobe: FATAL: Module kheaders not found in directory /lib/modules/4.19.0-25-amd64
chdir(/lib/modules/4.19.0-25-amd64/build): No such file or directory
I0914 14:08:09.019217       1 bcc_attacher.go:80] failed to attach the bpf program: <nil>
I0914 14:08:09.019250       1 bcc_attacher.go:159] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to attach the bpf program: <nil>, from default kernel source.
I0914 14:08:09.019289       1 bcc_attacher.go:162] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64
bpf: Failed to load program: Invalid argument

I0914 14:08:09.647169       1 bcc_attacher.go:166] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64"
I0914 14:08:09.647198       1 bcc_attacher.go:162] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64
bpf: Failed to load program: Invalid argument

I0914 14:08:10.213073       1 bcc_attacher.go:166] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64"
I0914 14:08:10.213115       1 bcc_attacher.go:174] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0914 14:08:10.213184       1 exporter.go:241] failed to start : failed to attach bpf assets: failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0914 14:08:10.213274       1 container_energy.go:109] Using the Ratio/DynPower Power Model to estimate Container Platform Power
I0914 14:08:10.213289       1 container_energy.go:118] Using the Ratio/DynPower Power Model to estimate Container Component Power
I0914 14:08:10.213297       1 process_power.go:108] Using the Ratio/DynPower Power Model to estimate Process Platform Power
I0914 14:08:10.213321       1 process_power.go:117] Using the Ratio/DynPower Power Model to estimate Process Component Power
I0914 14:08:10.213497       1 node_platform_energy.go:53] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power

Retrieving info:

> kubectl exec -ti -n kepler daemonset/kepler-exporter -- bash  -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total
# HELP kepler_container_core_joules_total Aggregated RAPL value in core in joules
# TYPE kepler_container_core_joules_total counter
kepler_container_core_joules_total{command="",container_id="1e6d5a6a71c93151b7a2e5a04a25debd10ad87bee0cc80f683ab094afdb15818",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-10-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="1e6d5a6a71c93151b7a2e5a04a25debd10ad87bee0cc80f683ab094afdb15818",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-10-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="d241eb6b91d008a94c63ef826edf23b47f60648a634e8d608eac18692fddb567",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-9-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="d241eb6b91d008a94c63ef826edf23b47f60648a634e8d608eac18692fddb567",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-9-guest-linpack"} 0

sunya-ch commented 1 year ago

@andersonandrei can you share your yaml? or can you try this (generated from main branch)
kubectl apply -f https://gist.githubusercontent.com/rootfs/24f30eec07d955df9da1b10f8d403d8d/raw/98ab2d2ab93e85a1ed0d3f921f40a2ef013babcd/kepler-eks-bm.yaml

Agree to try this yaml.

RBAC looks good to me... Before changing the yaml, could you also share the result of following command to last confirm about rbac.

kubectl get clusterrolebinding kepler-clusterrole-binding -oyaml
kubectl auth can-i list pod --as system:serviceaccount:monitoring:kepler-sa

Is the log still showing?

W0714 14:02:12.989342       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope

andersonandrei commented 1 year ago

@andersonandrei can you share your yaml? or can you try this (generated from main branch)
kubectl apply -f https://gist.githubusercontent.com/rootfs/24f30eec07d955df9da1b10f8d403d8d/raw/98ab2d2ab93e85a1ed0d3f921f40a2ef013babcd/kepler-eks-bm.yaml
Agree to try this yaml.

RBAC looks good to me... Before changing the yaml, could you also share the result of following command to last confirm about rbac.
kubectl get clusterrolebinding kepler-clusterrole-binding -oyaml
kubectl auth can-i list pod --as system:serviceaccount:monitoring:kepler-sa
Is the log still showing?
W0714 14:02:12.989342       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
@sunya-ch , I just changed back to my original yaml to do the checks you suggested. Here are the tests:

clusterrolebinding:

> kubectl get clusterrolebinding kepler-clusterrole-binding -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRoleBinding","metadata":{"annotations":{},"labels":{"sustainable-computing.io/app":"kepler"},"name":"kepler-clusterrole-binding"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"kepler-clusterrole"},"subjects":[{"kind":"ServiceAccount","name":"kepler-sa","namespace":"monitoring"}]}
  creationTimestamp: "2023-09-14T15:40:08Z"
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-clusterrole-binding
  resourceVersion: "33876"
  uid: d76b8c91-277a-471a-8848-5aa69fb92cce
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kepler-clusterrole
subjects:
- kind: ServiceAccount
  name: kepler-sa
  namespace: monitoring

Authorization:

> kubectl auth can-i list pod --as system:serviceaccount:monitoring:kepler-sa
yes

Kepler logs are not showing anymore the message ' cannot list resource "pods" '. Here are the full logs:

> kubectl logs kepler-exporter-rtcb2 -n monitoring
I0914 15:40:11.639224       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0914 15:40:11.646136       1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0914 15:40:11.652823       1 exporter.go:158] Kepler running on version: 5f33240
I0914 15:40:11.652836       1 config.go:272] using gCgroup ID in the BPF program: true
I0914 15:40:11.652856       1 config.go:274] kernel version: 4.19
I0914 15:40:11.652907       1 config.go:299] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0914 15:40:11.652912       1 exporter.go:170] LibbpfBuilt: false, BccBuilt: true
I0914 15:40:11.652932       1 config.go:205] kernel source dir is set to /usr/share/kepler/kernel_sources
I0914 15:40:11.652980       1 exporter.go:189] EnabledBPFBatchDelete: true
I0914 15:40:11.653014       1 power.go:54] use sysfs to obtain power
I0914 15:40:11.653022       1 redfish.go:169] failed to get redfish credential file path
I0914 15:40:11.656419       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0914 15:40:11.680453       1 exporter.go:204] Initializing the GPU collector
I0914 15:40:17.686068       1 watcher.go:66] Using in cluster k8s config
modprobe: FATAL: Module kheaders not found in directory /lib/modules/4.19.0-25-amd64
chdir(/lib/modules/4.19.0-25-amd64/build): No such file or directory
I0914 15:40:17.814031       1 bcc_attacher.go:80] failed to attach the bpf program: <nil>
I0914 15:40:17.814043       1 bcc_attacher.go:159] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to attach the bpf program: <nil>, from default kernel source.
I0914 15:40:17.814053       1 bcc_attacher.go:162] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64
bpf: Failed to load program: Invalid argument

I0914 15:40:18.419320       1 bcc_attacher.go:166] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64"
I0914 15:40:18.419346       1 bcc_attacher.go:162] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64
bpf: Failed to load program: Invalid argument

I0914 15:40:18.980164       1 bcc_attacher.go:166] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64"
I0914 15:40:18.980215       1 bcc_attacher.go:174] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0914 15:40:18.980246       1 exporter.go:241] failed to start : failed to attach bpf assets: failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0914 15:40:18.980393       1 container_energy.go:109] Using the Ratio/DynPower Power Model to estimate Container Platform Power
I0914 15:40:18.980403       1 container_energy.go:118] Using the Ratio/DynPower Power Model to estimate Container Component Power
I0914 15:40:18.980430       1 process_power.go:108] Using the Ratio/DynPower Power Model to estimate Process Platform Power
I0914 15:40:18.980445       1 process_power.go:117] Using the Ratio/DynPower Power Model to estimate Process Component Power
I0914 15:40:18.980608       1 node_platform_energy.go:53] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power
I0914 15:40:18.980798       1 exporter.go:276] Started Kepler in 7.327992991s

rootfs commented 1 year ago

@andersonandrei can you try the kepler:latest-libbpf image?

andersonandrei commented 1 year ago

@andersonandrei can you try the kepler:latest-libbpf image?

@rootfs Yes,

Describing the pod:

> kubectl describe pod kepler-exporter-wljzc -n monitoring
Name:             kepler-exporter-wljzc
Namespace:        monitoring
Priority:         0
Service Account:  kepler-sa
Node:             troll-3.grenoble.grid5000.fr/172.16.22.3
Start Time:       Mon, 18 Sep 2023 14:59:24 +0200
Labels:           app.kubernetes.io/component=exporter
                  app.kubernetes.io/name=kepler-exporter
                  controller-revision-hash=7f7468c7b
                  pod-template-generation=1
                  sustainable-computing.io/app=kepler
Annotations:      cni.projectcalico.org/containerID: 8f3c015bd37497a3bdf2543da39f4dfca054dc6bbca6dc88dcdde4a5108334d9
                  cni.projectcalico.org/podIP: 10.42.1.28/32
                  cni.projectcalico.org/podIPs: 10.42.1.28/32
Status:           Running
IP:               10.42.1.28
IPs:
  IP:           10.42.1.28
Controlled By:  DaemonSet/kepler-exporter
Containers:
  kepler-exporter:
    Container ID:  docker://9313c27196247028e6e1006f21f073e965606b2f0714630cd292d676a5a4a093
    Image:         quay.io/sustainable_computing_io/kepler:latest-libbpf
    Image ID:      docker-pullable://quay.io/sustainable_computing_io/kepler@sha256:b71cfd5f5c291dfc59566b84dda2b7160e06d6b79455842853732ef5ac0a2a2f
    Port:          9102/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      -c
    Args:
      /usr/bin/kepler -v=1 -kernel-source-dir=/usr/share/kepler/kernel_sources
    State:          Running
      Started:      Mon, 18 Sep 2023 14:59:53 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  400Mi
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_IP:     (v1:status.hostIP)
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/kepler/kepler.config from cfm (ro)
      /lib/modules from lib-modules (rw)
      /proc from proc (rw)
      /sys from tracing (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z8dl8 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  DirectoryOrCreate
  tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
  cfm:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kepler-cfm
    Optional:  false
  kube-api-access-z8dl8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  4m21s  default-scheduler  Successfully assigned monitoring/kepler-exporter-wljzc to troll-3.grenoble.grid5000.fr
  Normal  Pulling    4m21s  kubelet            Pulling image "quay.io/sustainable_computing_io/kepler:latest-libbpf"
  Normal  Pulled     3m55s  kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest-libbpf" in 25.653990033s
  Normal  Created    3m53s  kubelet            Created container kepler-exporter
  Normal  Started    3m53s  kubelet            Started container kepler-exporter

Getting the logs:

> kubectl logs kepler-exporter-wljzc -n monitoring                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
I0918 12:59:53.958956       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
I0918 12:59:53.964905       1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
I0918 12:59:53.972969       1 exporter.go:158] Kepler running on version: dfe0145                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
I0918 12:59:53.972988       1 config.go:272] using gCgroup ID in the BPF program: true                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
I0918 12:59:53.973005       1 config.go:274] kernel version: 4.19                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
I0918 12:59:53.973088       1 config.go:299] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
I0918 12:59:53.973095       1 exporter.go:170] LibbpfBuilt: true, BccBuilt: false                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
I0918 12:59:53.973102       1 exporter.go:189] EnabledBPFBatchDelete: true                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
I0918 12:59:53.973141       1 power.go:54] use sysfs to obtain power                                                                                                                                                                                                  
I0918 12:59:53.973158       1 redfish.go:169] failed to get redfish credential file path                                                                                                                                                                              
I0918 12:59:53.977942       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?                                                                                                                                                                       
I0918 12:59:54.002536       1 exporter.go:204] Initializing the GPU collector                                                                                                                                                                                         
I0918 13:00:00.007965       1 watcher.go:66] Using in cluster k8s config                                                                                                                                                                                              
libbpf: loading /var/lib/kepler/bpfassets/amd64_kepler.bpf.o                                                                                                                                                                                                          
libbpf: elf: section(3) tracepoint/sched/sched_switch, size 2456, link 0, flags 6, type=1                                                                                                                                                                             
libbpf: sec 'tracepoint/sched/sched_switch': found program 'kepler_trace' at insn offset 0 (0 bytes), code size 307 insns (2456 bytes)                                                                                                                                
libbpf: elf: section(4) .reltracepoint/sched/sched_switch, size 384, link 27, flags 40, type=9                                                                                                                                                                        
libbpf: elf: section(5) tracepoint/irq/softirq_entry, size 144, link 0, flags 6, type=1                                                                                                                                                                               
libbpf: sec 'tracepoint/irq/softirq_entry': found program 'kepler_irq_trace' at insn offset 0 (0 bytes), code size 18 insns (144 bytes)                                                                                                                               
libbpf: elf: section(6) .reltracepoint/irq/softirq_entry, size 16, link 27, flags 40, type=9                                                                                                                                                                          
libbpf: elf: section(7) .data, size 8, link 0, flags 3, type=1                                                                                                                                                                                                        
libbpf: elf: section(8) .maps, size 352, link 0, flags 3, type=1                                                                                                                                                                                                      
libbpf: elf: section(9) license, size 4, link 0, flags 3, type=1                                                                                                                                                                                                      
libbpf: license of /var/lib/kepler/bpfassets/amd64_kepler.bpf.o is GPL                                                                                                                                                                                                
libbpf: elf: section(18) .BTF, size 5979, link 0, flags 0, type=1                                                                                                                                                                                                     
libbpf: elf: section(20) .BTF.ext, size 2120, link 0, flags 0, type=1                                                                                                                                                                                                 
libbpf: elf: section(27) .symtab, size 1056, link 1, flags 0, type=2                                                                                                                                                                                                  
libbpf: looking for externs among 44 symbols...                                                                                                                                                                                                                       
libbpf: collected 0 externs total                                                                                                                                                                                                                                     
libbpf: map 'processes': at sec_idx 8, offset 0.                                                                                                                                                                                                                      
libbpf: map 'processes': found type = 1.                                                                                                                                                                                                                              
libbpf: map 'processes': found key [6], sz = 4.                                                                                                                                                                                                                       
libbpf: map 'processes': found value [10], sz = 88.                                                                                                                                                                                                                   
libbpf: map 'processes': found max_entries = 32768.                                                                                                                                                                                                                   
libbpf: map 'pid_time': at sec_idx 8, offset 32.                                                                                                                                                                                                                      
libbpf: map 'pid_time': found type = 1.                                                                                                                                                                                                                               
libbpf: map 'pid_time': found key [6], sz = 4.                                                                                                                                         
libbpf: map 'pid_time': found value [12], sz = 8.                                                                                                                                      
libbpf: map 'pid_time': found max_entries = 32768.                                                                                                                                     
libbpf: map 'cpu_cycles_hc_reader': at sec_idx 8, offset 64.                                                                                                                           
libbpf: map 'cpu_cycles_hc_reader': found type = 4.                                                                                                                                                                                  
libbpf: map 'cpu_cycles_hc_reader': found key [2], sz = 4.                                                                                                                                                                           
libbpf: map 'cpu_cycles_hc_reader': found value [6], sz = 4.                                                                                                                                                                         
libbpf: map 'cpu_cycles_hc_reader': found max_entries = 128.                                                                                                                                                                         
libbpf: map 'cpu_cycles': at sec_idx 8, offset 96.                                                                                                                                                                                   
libbpf: map 'cpu_cycles': found type = 2.                                                                                                                                                                                            
libbpf: map 'cpu_cycles': found key [6], sz = 4.                                                                                                        
libbpf: map 'cpu_cycles': found value [12], sz = 8.                                                                                                     
libbpf: map 'cpu_cycles': found max_entries = 128.                                                                                                      
libbpf: map 'cpu_ref_cycles_hc_reader': at sec_idx 8, offset 128.                                                                                       
libbpf: map 'cpu_ref_cycles_hc_reader': found type = 4.                                                                                                 
libbpf: map 'cpu_ref_cycles_hc_reader': found key [2], sz = 4.                                                                                          
libbpf: map 'cpu_ref_cycles_hc_reader': found value [6], sz = 4.                                                                                        
libbpf: map 'cpu_ref_cycles_hc_reader': found max_entries = 128.                                                                                                                                                                                                                                                 
libbpf: map 'cpu_ref_cycles': at sec_idx 8, offset 160.                                                                                                 
libbpf: map 'cpu_ref_cycles': found type = 2.                                                                                                           
libbpf: map 'cpu_ref_cycles': found key [6], sz = 4.                                                                                                    
libbpf: map 'cpu_ref_cycles': found value [12], sz = 8.                                                                                                 
libbpf: map 'cpu_ref_cycles': found max_entries = 128.                                                                                                  
libbpf: map 'cpu_instructions_hc_reader': at sec_idx 8, offset 192.                                                                                     
libbpf: map 'cpu_instructions_hc_reader': found type = 4.                                                                                               
libbpf: map 'cpu_instructions_hc_reader': found key [2], sz = 4.                                                                                        
libbpf: map 'cpu_instructions_hc_reader': found value [6], sz = 4.                                                                                                                                                                                                                                               
libbpf: map 'cpu_instructions_hc_reader': found max_entries = 128.                                                                                      
libbpf: map 'cpu_instructions': at sec_idx 8, offset 224.                                                                                                                                                                            
libbpf: map 'cpu_instructions': found type = 2.                                                                                                                                                                                      
libbpf: map 'cpu_instructions': found key [6], sz = 4.                                                                                                                                                                               
libbpf: map 'cpu_instructions': found value [12], sz = 8.                                                                                                                                                                            
libbpf: map 'cpu_instructions': found max_entries = 128.                                                                                                                                                                             
libbpf: map 'cache_miss_hc_reader': at sec_idx 8, offset 256.                                                                                                                                                                        
libbpf: map 'cache_miss_hc_reader': found type = 4.                                                                                                                                                                                  
libbpf: map 'cache_miss_hc_reader': found key [2], sz = 4.                                                                                                                                                                           
libbpf: map 'cache_miss_hc_reader': found value [6], sz = 4.                                                                                                                                                                         
libbpf: map 'cache_miss_hc_reader': found max_entries = 128.                                                                                                                                                                         
libbpf: map 'cache_miss': at sec_idx 8, offset 288.                                                                                                                                                                                  
libbpf: map 'cache_miss': found type = 2.                                                                                                                                                                                            
libbpf: map 'cache_miss': found key [6], sz = 4.                                                                                                                                                                                     
libbpf: map 'cache_miss': found value [12], sz = 8.                                                                                                                                                                                  
libbpf: map 'cache_miss': found max_entries = 128.                                                                                                                                                                                   
libbpf: map 'cpu_freq_array': at sec_idx 8, offset 320.                                                                                                                                                                              
libbpf: map 'cpu_freq_array': found type = 2.                                                                                                                                                                                        
libbpf: map 'cpu_freq_array': found key [6], sz = 4.                                                                                                                                                                                 
libbpf: map 'cpu_freq_array': found value [6], sz = 4.                                                                                                                                                                               
libbpf: map 'cpu_freq_array': found max_entries = 128.                                                                                                                                                                               
libbpf: map 'amd64_ke.data' (global data): at sec_idx 7, offset 0, flags 400.                                                                                                                                                        
libbpf: map 11 is "amd64_ke.data"                                                                                                                                                                                                    
libbpf: sec '.reltracepoint/sched/sched_switch': collecting relocation for section(3) 'tracepoint/sched/sched_switch'                                                                                                                
libbpf: sec '.reltracepoint/sched/sched_switch': relo #0: insn #1 against 'counter'                                                                                                                                                  
libbpf: prog 'kepler_trace': found data map 11 (amd64_ke.data, sec 7, off 0) for insn 1                                                                                                                                              
libbpf: sec '.reltracepoint/sched/sched_switch': relo #1: insn #12 against 'sample_rate'                                                                                                                                             
libbpf: prog 'kepler_trace': found data map 11 (amd64_ke.data, sec 7, off 0) for insn 12                                                                                                                                             
libbpf: sec '.reltracepoint/sched/sched_switch': relo #2: insn #32 against 'cpu_cycles_hc_reader'                                                                                                                                    
libbpf: prog 'kepler_trace': found map 2 (cpu_cycles_hc_reader, sec 8, off 64) for insn #32                                                                                                                                          
libbpf: sec '.reltracepoint/sched/sched_switch': relo #3: insn #51 against 'cpu_cycles'                                                                                                                                              
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 8, off 96) for insn #51                                                                                                                                                    
libbpf: sec '.reltracepoint/sched/sched_switch': relo #4: insn #65 against 'cpu_cycles'                                                                                                                                              
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 8, off 96) for insn #65                                                                                                                                                    
libbpf: sec '.reltracepoint/sched/sched_switch': relo #5: insn #70 against 'cpu_ref_cycles_hc_reader'                                                                                                                                
libbpf: prog 'kepler_trace': found map 4 (cpu_ref_cycles_hc_reader, sec 8, off 128) for insn #70                                                                                                                                     
libbpf: sec '.reltracepoint/sched/sched_switch': relo #6: insn #83 against 'cpu_ref_cycles'                                                                                                                                          
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 8, off 160) for insn #83                                                                                                                                               
libbpf: sec '.reltracepoint/sched/sched_switch': relo #7: insn #97 against 'cpu_ref_cycles'                                                                                                                                          
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 8, off 160) for insn #97                                                                                                                                               
libbpf: sec '.reltracepoint/sched/sched_switch': relo #8: insn #102 against 'cpu_instructions_hc_reader'                                                                                                                             
libbpf: prog 'kepler_trace': found map 6 (cpu_instructions_hc_reader, sec 8, off 192) for insn #102                                                                                                                                  
libbpf: sec '.reltracepoint/sched/sched_switch': relo #9: insn #119 against 'cpu_instructions'                                                                                                                                       
libbpf: prog 'kepler_trace': found map 7 (cpu_instructions, sec 8, off 224) for insn #119                                                                                                                                            
libbpf: sec '.reltracepoint/sched/sched_switch': relo #10: insn #132 against 'cpu_instructions'                                                                                                                                      
libbpf: prog 'kepler_trace': found map 7 (cpu_instructions, sec 8, off 224) for insn #132                                                                                                                                            
libbpf: sec '.reltracepoint/sched/sched_switch': relo #11: insn #137 against 'cache_miss_hc_reader'                                                                                                                                  
libbpf: prog 'kepler_trace': found map 8 (cache_miss_hc_reader, sec 8, off 256) for insn #137                                                                                                                                        
libbpf: sec '.reltracepoint/sched/sched_switch': relo #12: insn #149 against 'cache_miss'                                                                                                                                            
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 8, off 288) for insn #149                                                                                                                                                  
libbpf: sec '.reltracepoint/sched/sched_switch': relo #13: insn #163 against 'cache_miss'                                                                                                                                            
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 8, off 288) for insn #163                                                                                                                                                  
libbpf: sec '.reltracepoint/sched/sched_switch': relo #14: insn #171 against 'cpu_freq_array'                                                                                                                                        
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 8, off 320) for insn #171                                                                                                                                             
libbpf: sec '.reltracepoint/sched/sched_switch': relo #15: insn #185 against 'cpu_freq_array'                                                                                                                                        
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 8, off 320) for insn #185                                                                                                                                             
libbpf: sec '.reltracepoint/sched/sched_switch': relo #16: insn #197 against 'cpu_freq_array'                                                                                                                                        
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 8, off 320) for insn #197                                                                                                                                             
libbpf: sec '.reltracepoint/sched/sched_switch': relo #17: insn #221 against 'cpu_freq_array'                                                                                                                                        
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 8, off 320) for insn #221                                                                                                                                             
libbpf: sec '.reltracepoint/sched/sched_switch': relo #18: insn #230 against 'pid_time'                                                                                                                                              
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 8, off 32) for insn #230                                                                                                                                                     
libbpf: sec '.reltracepoint/sched/sched_switch': relo #19: insn #238 against 'pid_time'                                                                                                                                              
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 8, off 32) for insn #238                                                                                                                                                     
libbpf: sec '.reltracepoint/sched/sched_switch': relo #20: insn #250 against 'pid_time'                                                                                                                                              
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 8, off 32) for insn #250                                                                                                                                                     
libbpf: sec '.reltracepoint/sched/sched_switch': relo #21: insn #256 against 'processes'                                                                                                                                             
libbpf: prog 'kepler_trace': found map 0 (processes, sec 8, off 0) for insn #256                                                                                                                                                     
libbpf: sec '.reltracepoint/sched/sched_switch': relo #22: insn #276 against 'processes'                                                                                                                                             
libbpf: prog 'kepler_trace': found map 0 (processes, sec 8, off 0) for insn #276                                                                                                                                                     
libbpf: sec '.reltracepoint/sched/sched_switch': relo #23: insn #302 against 'processes'                                                                                                                                             
libbpf: prog 'kepler_trace': found map 0 (processes, sec 8, off 0) for insn #302                                                                                                                                                     
libbpf: sec '.reltracepoint/irq/softirq_entry': collecting relocation for section(5) 'tracepoint/irq/softirq_entry'                                                                                                                  
libbpf: sec '.reltracepoint/irq/softirq_entry': relo #0: insn #5 against 'processes'                                                                                                                                                 
libbpf: prog 'kepler_irq_trace': found map 0 (processes, sec 8, off 0) for insn #5                                                                                                                                                   
libbpf: map 'processes': created successfully, fd=10                                                                                                                                                                                 
libbpf: map 'pid_time': created successfully, fd=11                                                                                                                                                                                  
libbpf: map 'cpu_cycles_hc_reader': created successfully, fd=12                                                                                                                                                                      
libbpf: map 'cpu_cycles': created successfully, fd=13                                                                                                                                                                                
libbpf: map 'cpu_ref_cycles_hc_reader': created successfully, fd=14                                                                                                                                                                  
libbpf: map 'cpu_ref_cycles': created successfully, fd=15                                                                                                                                                                            
libbpf: map 'cpu_instructions_hc_reader': created successfully, fd=16                                                                                                                                                                
libbpf: map 'cpu_instructions': created successfully, fd=17                                                                                                                                                                          
libbpf: map 'cache_miss_hc_reader': created successfully, fd=18                                                                                                                                                                      
libbpf: map 'cache_miss': created successfully, fd=19                                                                                                                                                                                
libbpf: map 'cpu_freq_array': created successfully, fd=20                                                                                                                                                                            
libbpf: map 'amd64_ke.data': skipped auto-creating...                                                                                                                                                                                
libbpf: prog 'kepler_trace': relo #0: poisoning insn #1 that loads map #11 'amd64_ke.data'                                                                                                                                           
libbpf: prog 'kepler_trace': relo #1: poisoning insn #12 that loads map #11 'amd64_ke.data'                                                                                                                                          
libbpf: prog 'kepler_trace': BPF program load failed: Invalid argument                                                                                                                                                               
libbpf: prog 'kepler_trace': -- BEGIN PROG LOAD LOG --                                                                                                                                                                               
0: (bf) r8 = r1                                                                                                                                                                                                                      
1: <invalid BPF map reference>                                                                                                                                                                                                       
BPF map 'amd64_ke.data' is referenced but wasn't created                                                                                                                                                                             
-- END PROG LOAD LOG --                                                                                                                                                                                                              
libbpf: prog 'kepler_trace': failed to load: -22                                                                                                                                                                                     
libbpf: failed to load object '/var/lib/kepler/bpfassets/amd64_kepler.bpf.o'                                                                                                                                                         
libbpf: prog 'kepler_trace': can't attach BPF program w/o FD (did you load it?)                                                                                                                                                      
libbpf: prog 'kepler_trace': failed to attach to tracepoint 'sched/sched_switch': Invalid argument                                                                                                                                   
I0918 13:00:00.139399       1 bpf_perf.go:132] failed to attach bpf with libbpf: failed to attach sched/sched_switch: failed to attach tracepoint sched_switch to program kepler_trace: invalid argument                             
I0918 13:00:00.139416       1 exporter.go:241] failed to start : failed to attach bpf assets: no bcc build tag                                                                                                                       
I0918 13:00:00.139484       1 container_energy.go:109] Using the Ratio/DynPower Power Model to estimate Container Platform Power                                                                                                     
I0918 13:00:00.139495       1 container_energy.go:118] Using the Ratio/DynPower Power Model to estimate Container Component Power                                                                                                    
I0918 13:00:00.139502       1 process_power.go:108] Using the Ratio/DynPower Power Model to estimate Process Platform Power                                                                                                          
I0918 13:00:00.139513       1 process_power.go:117] Using the Ratio/DynPower Power Model to estimate Process Component Power                                                                                                         
I0918 13:00:00.139688       1 node_platform_energy.go:53] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power                                                                                             
I0918 13:00:00.139804       1 exporter.go:276] Started Kepler in 6.166857795s

And then, looking the outputs of Kepler:

> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash  -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total
# HELP kepler_container_core_joules_total Aggregated RAPL value in core in joules
# TYPE kepler_container_core_joules_total counter
kepler_container_core_joules_total{command="",container_id="2696397f01e6ad716f59037da407d9c53e4a4504981b4bf299b2e5973b81f872",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-4-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="2696397f01e6ad716f59037da407d9c53e4a4504981b4bf299b2e5973b81f872",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-4-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="3a71a1a3b2c59d457b2dd73eae974023024a92ea68bcf49163d9bad3030de682",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-5-guest-matmul"} 0
kepler_container_core_joules_total{command="",container_id="3a71a1a3b2c59d457b2dd73eae974023024a92ea68bcf49163d9bad3030de682",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-5-guest-matmul"} 0

stale[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.