Closed andersonandrei closed 10 months ago
@rootfs @sunya-ch do we still need to install the kernel headers?
@andersonandrei you need to install the kernel headers in your node.
@marceloamaral , as I showed above, they are installed on both nodes:
> kubectl exec -ti debug-9trwq bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
root@paravance-49:/# nsenter --mount=/proc/1/ns/mnt -- sh -s
# ls /lib/modules
4.19.0-24-amd64
# ls /usr/lib/modules
4.19.0-24-amd64
> kubectl exec -ti debug-x6wlt bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
root@paravance-40:/# nsenter --mount=/proc/1/ns/mnt -- sh -s
# ls /lib/modules
4.19.0-24-amd64
# ls /usr/lib/modules
4.19.0-24-amd64
@andersonandrei kernel headers are installed in /lib/modules/4.19.0-24-amd64/build
, can you double check?
container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756672 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
is addressed in this PR. Please use the latest image.
@andersonandrei can you do this on your linux? The reason ebpf cannot find tracepoint might be related to the kernel compilation. You can see how kepler deals with different tracepoint signature here
grep finish_task_switch /proc/kallsyms
Here is my output
# grep finish_task_switch /proc/kallsyms
ffffffffaaf1f730 t finish_task_switch
@rootfs , please find here both verifications, for the lib modules and the tracepoint:
First node verification:
> kubectl exec -ti debug-w4hjz bash
root@paravance-59:/# nsenter --mount=/proc/1/ns/mnt -- sh -s
# ls /lib/modules/4.19.0-24-amd64
build modules.alias modules.builtin modules.dep modules.devname modules.softdep modules.symbols.bin updates
kernel modules.alias.bin modules.builtin.bin modules.dep.bin modules.order modules.symbols source
# ls /lib/modules/4.19.0-24-amd64/build
Makefile Module.symvers arch include scripts tools
# grep finish_task_switch /proc/kallsyms
ffffffffaaca2f50 t finish_task_switch
Second node:
> kubectl exec -ti debug-n6fsg bash
root@paravance-62:/# nsenter --mount=/proc/1/ns/mnt -- sh -s
# ls /lib/modules/4.19.0-24-amd64
build modules.alias modules.builtin modules.dep modules.devname modules.softdep modules.symbols.bin updates
kernel modules.alias.bin modules.builtin.bin modules.dep.bin modules.order modules.symbols source
# ls /lib/modules/4.19.0-24-amd64/build
Makefile Module.symvers arch include scripts tools
# grep finish_task_switch /proc/kallsyms
ffffffff976a2f50 t finish_task_switch
@andersonandrei can you go to the kepler pods to check the kernel source?
kubectl exec -ti -n kepler daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/`uname -r`/build; cd /lib/modules/`uname -r`/; ls"
@rootfs ,
> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/`uname -r`/build; cd /lib/modules/`uname -r`/; ls -l /usr/lib/modules/; ls"
ls: cannot access '/lib/modules/5.10.0-23-amd64/build': No such file or directory
bash: line 1: cd: /lib/modules/5.10.0-23-amd64/: No such file or directory
total 4
drwxr-xr-x 4 root root 4096 Jul 13 18:35 4.19.0-24-amd64
NGC-DL-CONTAINER-LICENSE afs bin boot dev etc home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var
So, it means that Kepler searches for another version of kernel? The 4.19 is also there, but it searches for the 5.10 ?
@rootfs @sunya-ch do we still need to install the kernel headers?
with libbpf no, but the default is using bcc which is still using the kernel header in the image file.
that's odd, I am not sure why uname -r
turns to 5.10.0-23-amd64
.
Can you run
kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/4.19.0-24-amd64; cd /lib/modules/4.19.0-24-amd64/build; ls"
@rootfs ,
> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/4.19.0-24-amd64; cd /lib/modules/4.19.0-24-amd64/build; ls"
total 4484
lrwxrwxrwx 1 root root 38 Apr 29 20:07 build -> /usr/src/linux-headers-4.19.0-24-amd64
drwxr-xr-x 12 root root 4096 May 16 12:22 kernel
-rw-r--r-- 1 root root 1143894 Jul 14 10:19 modules.alias
-rw-r--r-- 1 root root 1092710 Jul 14 10:19 modules.alias.bin
-rw-r--r-- 1 root root 4683 Apr 29 20:07 modules.builtin
-rw-r--r-- 1 root root 5999 Jul 14 10:19 modules.builtin.bin
-rw-r--r-- 1 root root 436046 Jul 14 10:19 modules.dep
-rw-r--r-- 1 root root 594001 Jul 14 10:19 modules.dep.bin
-rw-r--r-- 1 root root 456 Jul 14 10:19 modules.devname
-rw-r--r-- 1 root root 140020 Apr 29 20:07 modules.order
-rw-r--r-- 1 root root 876 Jul 14 10:19 modules.softdep
-rw-r--r-- 1 root root 507648 Jul 14 10:19 modules.symbols
-rw-r--r-- 1 root root 626742 Jul 14 10:19 modules.symbols.bin
lrwxrwxrwx 1 root root 39 Apr 29 20:07 source -> /usr/src/linux-headers-4.19.0-24-common
drwxr-xr-x 3 root root 4096 Jul 14 10:19 updates
bash: line 1: cd: /lib/modules/4.19.0-24-amd64/build: No such file or directory
NGC-DL-CONTAINER-LICENSE afs bin boot dev etc home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var
So it seems that inside the pod there i no /buid even if there is inside the node ?
I don't know if it would help, but I needed to modify the Daemonset definition adding DirectoryOrCreate
in the volum type of /lib/modules
:
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
sustainable-computing.io/app: kepler
name: kepler-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
template:
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
spec:
containers:
- args:
- /usr/bin/kepler -v=1
command:
- /bin/sh
- -c
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
image: quay.io/sustainable_computing_io/kepler:latest
imagePullPolicy: Always
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthz
port: 9102
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 10
name: kepler-exporter
ports:
- containerPort: 9102
name: http
resources:
requests:
cpu: 100m
memory: 400Mi
securityContext:
privileged: true
volumeMounts:
- mountPath: /lib/modules
name: lib-modules
- mountPath: /sys
name: tracing
- mountPath: /proc
name: proc
- mountPath: /etc/config
name: cfm
readOnly: true
dnsPolicy: ClusterFirstWithHostNet
serviceAccountName: kepler-sa
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
volumes:
- hostPath:
path: /lib/modules
type: DirectoryOrCreate
name: lib-modules
- hostPath:
path: /sys
type: Directory
name: tracing
- hostPath:
path: /proc
type: Directory
name: proc
- configMap:
name: kepler-cfm
name: cfm
---
Otherwise, the Kepler pod remains in the Creation phase:
> kubectl describe pod kepler-exporter-8vvtl -n monitoring
Name: kepler-exporter-8vvtl
Namespace: monitoring
Priority: 0
Service Account: kepler-sa
Node: parasilo-14.rennes.grid5000.fr/172.16.97.14
Start Time: Fri, 14 Jul 2023 15:13:57 +0200
Labels: app.kubernetes.io/component=exporter
app.kubernetes.io/name=kepler-exporter
controller-revision-hash=5b4d46847b
pod-template-generation=1
sustainable-computing.io/app=kepler
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/kepler-exporter
Containers:
kepler-exporter:
Container ID:
Image: quay.io/sustainable_computing_io/kepler:latest
Image ID:
Port: 9102/TCP
Host Port: 0/TCP
Command:
/bin/sh
-c
Args:
/usr/bin/kepler -v=1
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Requests:
cpu: 100m
memory: 400Mi
Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
Environment:
NODE_NAME: (v1:spec.nodeName)
Mounts:
/etc/config from cfm (ro)
/lib/modules from lib-modules (rw)
/proc from proc (rw)
/sys from tracing (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hpt5q (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType: Directory
tracing:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType: Directory
cfm:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kepler-cfm
Optional: false
kube-api-access-hpt5q:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m5s default-scheduler Successfully assigned monitoring/kepler-exporter-8vvtl to parasilo-14.rennes.grid5000.fr
Warning FailedMount 53s (x11 over 7m5s) kubelet MountVolume.SetUp failed for volume "lib-modules" : hostPath type check failed: /lib/modules is not a directory
Warning FailedMount 31s (x3 over 5m3s) kubelet Unable to attach or mount volumes: unmounted volumes=[lib-modules], unattached volumes=[tracing proc cfm kube-api-access-hpt5q lib-modules]: timed out waiting for the condition
@andersonandrei that sounds the problem :D
In this case, /lib/modules/4.19.0-24-amd64/build
is a link on your host. Kepler pod cannot see the /usr/src
directory on the host so it cannot find kernel source.
Please use this config as an example to bind mount /usr/src
to your kepler pod. Note, this example mounts /usr/src/kernels
, while in your setup, the host path is /usr/src
@rootfs,
I just added the /usr/src/kernels
and the /sys/kernel/debug
, but again I needed to use DirectoryOrCreate
, otherwise the pod did not intiate:
> kubectl describe pod kepler-exporter-f2lbq -n monitoring
Name: kepler-exporter-f2lbq
Namespace: monitoring
Priority: 0
Service Account: kepler-sa
Node: parasilo-14.rennes.grid5000.fr/172.16.97.14
Start Time: Fri, 14 Jul 2023 16:00:04 +0200
Labels: app.kubernetes.io/component=exporter
app.kubernetes.io/name=kepler-exporter
controller-revision-hash=765cd98545
pod-template-generation=3
sustainable-computing.io/app=kepler
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/kepler-exporter
Containers:
kepler-exporter:
Container ID:
Image: quay.io/sustainable_computing_io/kepler:latest
Image ID:
Port: 9102/TCP
Host Port: 0/TCP
Command:
/bin/sh
-c
Args:
/usr/bin/kepler -v=1
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Requests:
cpu: 100m
memory: 400Mi
Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
Environment:
NODE_NAME: (v1:spec.nodeName)
Mounts:
/etc/config from cfm (ro)
/lib/modules from lib-modules (rw)
/proc from proc (rw)
/sys from tracing (rw)
/sys/kernel/debug from kernel-debug (rw)
/usr/src/kernels from kernel-src (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hmtdw (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kernel-debug:
Type: HostPath (bare host directory volume)
Path: /sys/kernel/debug
HostPathType: Directory
kernel-src:
Type: HostPath (bare host directory volume)
Path: /usr/src/kernels
HostPathType: Directory
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType: DirectoryOrCreate
tracing:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType: Directory
cfm:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kepler-cfm
Optional: false
kube-api-access-hmtdw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 60s default-scheduler Successfully assigned monitoring/kepler-exporter-f2lbq to parasilo-14.rennes.grid5000.fr
Warning FailedMount 29s (x7 over 60s) kubelet MountVolume.SetUp failed for volume "kernel-src" : hostPath type check failed: /usr/src/kernels is not a directory
However, even running the logs do not look good:
> kubectl logs kepler-exporter-jdh4c -n monitoring
I0714 14:02:12.859441 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open
shared object file: No such file or directory
I0714 14:02:12.865879 1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0714 14:02:12.873527 1 exporter.go:151] Kepler running on version: eba46bc
I0714 14:02:12.873554 1 config.go:212] using gCgroup ID in the BPF program: true
I0714 14:02:12.873575 1 config.go:214] kernel version: 4.19
I0714 14:02:12.873594 1 exporter.go:171] EnabledBPFBatchDelete: true
I0714 14:02:12.873680 1 power.go:53] use sysfs to obtain power
I0714 14:02:12.982707 1 watcher.go:67] Using in cluster k8s config
W0714 14:02:12.989342 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>:
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:12.989402 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope
W0714 14:02:14.275674 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>:
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:14.275742 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope
W0714 14:02:17.382623 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>:
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:17.382663 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope
W0714 14:02:22.712307 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>:
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
E0714 14:02:22.712355 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>:
failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c
luster scope
The queries do not work:
> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total
command terminated with exit code 7
> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "curl localhost:9102/metrics"
curl: (7) Failed to connect to localhost port 9102: Connection refused
command terminated with exit code 7
Then, I just checked again the previous commands you asked me:
> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/4.19.0-24-amd64; cd /lib/modules/4.19.0-24-amd64/build; ls"
total 4484
lrwxrwxrwx 1 root root 38 Apr 29 20:07 build -> /usr/src/linux-headers-4.19.0-24-amd64
drwxr-xr-x 12 root root 4096 May 16 12:22 kernel
-rw-r--r-- 1 root root 1143894 Jul 14 10:19 modules.alias
-rw-r--r-- 1 root root 1092710 Jul 14 10:19 modules.alias.bin
-rw-r--r-- 1 root root 4683 Apr 29 20:07 modules.builtin
-rw-r--r-- 1 root root 5999 Jul 14 10:19 modules.builtin.bin
-rw-r--r-- 1 root root 436046 Jul 14 10:19 modules.dep
-rw-r--r-- 1 root root 594001 Jul 14 10:19 modules.dep.bin
-rw-r--r-- 1 root root 456 Jul 14 10:19 modules.devname
-rw-r--r-- 1 root root 140020 Apr 29 20:07 modules.order
-rw-r--r-- 1 root root 876 Jul 14 10:19 modules.softdep
-rw-r--r-- 1 root root 507648 Jul 14 10:19 modules.symbols
-rw-r--r-- 1 root root 626742 Jul 14 10:19 modules.symbols.bin
lrwxrwxrwx 1 root root 39 Apr 29 20:07 source -> /usr/src/linux-headers-4.19.0-24-common
drwxr-xr-x 3 root root 4096 Jul 14 10:19 updates
bash: line 1: cd: /lib/modules/4.19.0-24-amd64/build: No such file or directory
NGC-DL-CONTAINER-LICENSE afs bin boot dev etc home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var
> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/`uname -r`/build; cd /lib/modules/`uname -r`/; ls"
ls: cannot access '/lib/modules/5.10.0-23-amd64/build': No such file or directory
bash: line 1: cd: /lib/modules/5.10.0-23-amd64/: No such file or directory
NGC-DL-CONTAINER-LICENSE afs bin boot dev etc home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var
And I can see that the pods now have a lot of CrashLoopBackOff:
monitoring kepler-exporter-jdh4c 0/1 CrashLoopBackOff 4 (83s ago) 8m1s
monitoring kepler-exporter-rs6lg 0/1 CrashLoopBackOff 4 (79s ago) 8m1s
~/kepler main ?1 kube local adasilva@frennes 16:09:28
> kubectl describe pod kepler-exporter-rs6lg -n monitoring
Name: kepler-exporter-rs6lg
Namespace: monitoring
Priority: 0
Service Account: kepler-sa
Node: parasilo-14.rennes.grid5000.fr/172.16.97.14
Start Time: Fri, 14 Jul 2023 16:02:09 +0200
Labels: app.kubernetes.io/component=exporter
app.kubernetes.io/name=kepler-exporter
controller-revision-hash=54cf48cf9
pod-template-generation=4
sustainable-computing.io/app=kepler
Annotations: cni.projectcalico.org/containerID: d3ffb8f7025940f7f3a06d49b68e81276c4be9cda0afa9ce67aaf6090c7eb49e
cni.projectcalico.org/podIP: 10.42.2.20/32
cni.projectcalico.org/podIPs: 10.42.2.20/32
Status: Running
IP: 10.42.2.20
IPs:
IP: 10.42.2.20
Controlled By: DaemonSet/kepler-exporter
Containers:
kepler-exporter:
Container ID: docker://87c70c90e955fa59301770ef9e139aca051e7f0c6358bf87b6f98ac62fdec52f
Image: quay.io/sustainable_computing_io/kepler:latest
Image ID: docker-pullable://quay.io/sustainable_computing_io/kepler@sha256:7a3c21442015f0ce471aefc8425268d384eae9651f9d2543a8ec4b60be59b3d6
Port: 9102/TCP
Host Port: 0/TCP
Command:
/bin/sh
-c
Args:
/usr/bin/kepler -v=1
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Fri, 14 Jul 2023 16:07:51 +0200
Finished: Fri, 14 Jul 2023 16:08:51 +0200
Ready: False
Restart Count: 4
Requests:
cpu: 100m
memory: 400Mi
Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
Environment:
NODE_NAME: (v1:spec.nodeName)
Mounts:
/etc/config from cfm (ro)
/lib/modules from lib-modules (rw)
/proc from proc (rw)
/sys from tracing (rw)
/sys/kernel/debug from kernel-debug (rw)
/usr/src/kernels from kernel-src (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2qns (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kernel-debug:
Type: HostPath (bare host directory volume)
Path: /sys/kernel/debug
HostPathType: DirectoryOrCreate
kernel-src:
Type: HostPath (bare host directory volume)
Path: /usr/src/kernels
HostPathType: DirectoryOrCreate
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType: DirectoryOrCreate
tracing:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType: Directory
cfm:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kepler-cfm
Optional: false
kube-api-access-p2qns:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m34s default-scheduler Successfully assigned monitoring/kepler-exporter-rs6lg to parasilo-14.rennes.grid5000.fr
Normal Pulled 7m31s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.659839237s
Normal Pulled 6m28s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.648322644s
Normal Pulled 5m14s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.708725285s
Warning Unhealthy 4m34s (x3 over 6m34s) kubelet Liveness probe failed: Get "http://10.42.2.20:9102/healthz": dial tcp 10.42.2.20:9102: connect: connection refused
Normal Pulling 3m44s (x4 over 7m33s) kubelet Pulling image "quay.io/sustainable_computing_io/kepler:latest"
Normal Pulled 3m43s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.658658027s
Normal Created 3m42s (x4 over 7m31s) kubelet Created container kepler-exporter
Normal Started 3m42s (x4 over 7m31s) kubelet Started container kepler-exporter
Warning BackOff 2m9s (x7 over 5m26s) kubelet Back-off restarting failed container
kubectl logs kepler-exporter-jdh4c -n monitoring I0714 14:02:12.859441 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0714 14:02:12.865879 1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM? I0714 14:02:12.873527 1 exporter.go:151] Kepler running on version: eba46bc I0714 14:02:12.873554 1 config.go:212] using gCgroup ID in the BPF program: true I0714 14:02:12.873575 1 config.go:214] kernel version: 4.19 I0714 14:02:12.873594 1 exporter.go:171] EnabledBPFBatchDelete: true I0714 14:02:12.873680 1 power.go:53] use sysfs to obtain power I0714 14:02:12.982707 1 watcher.go:67] Using in cluster k8s config W0714 14:02:12.989342 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list
: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:12.989402 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch : failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope
W0714 14:02:14.275674 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:14.275742 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch : failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope
W0714 14:02:17.382623 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:17.382663 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch : failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope
W0714 14:02:22.712307 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:22.712355 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch : failed to list : pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope
Are you using the latest manifests?
I have built this manifest by make build-manifest OPTS="PROMETHEUS_DEPLOY
using the latest version of the repository, and the image latest
. For the last logs, I just modified and added the /usr/src/kernels
and /sys/kernel/debug
entries as suggested.
I'm getting a similar error. See the log below:
I0905 12:47:40.674692 1 bcc_attacher.go:253] could not delete bpf table elements, err: Table.Delete: key 0x0: no such file or directory
Does anyone can help me?
@rootfs,
I just added the
/usr/src/kernels
and the/sys/kernel/debug
, but again I needed to useDirectoryOrCreate
, otherwise the pod did not intiate:> kubectl describe pod kepler-exporter-f2lbq -n monitoring Name: kepler-exporter-f2lbq Namespace: monitoring Priority: 0 Service Account: kepler-sa Node: parasilo-14.rennes.grid5000.fr/172.16.97.14 Start Time: Fri, 14 Jul 2023 16:00:04 +0200 Labels: app.kubernetes.io/component=exporter app.kubernetes.io/name=kepler-exporter controller-revision-hash=765cd98545 pod-template-generation=3 sustainable-computing.io/app=kepler Annotations: <none> Status: Pending IP: IPs: <none> Controlled By: DaemonSet/kepler-exporter Containers: kepler-exporter: Container ID: Image: quay.io/sustainable_computing_io/kepler:latest Image ID: Port: 9102/TCP Host Port: 0/TCP Command: /bin/sh -c Args: /usr/bin/kepler -v=1 State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 100m memory: 400Mi Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5 Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /etc/config from cfm (ro) /lib/modules from lib-modules (rw) /proc from proc (rw) /sys from tracing (rw) /sys/kernel/debug from kernel-debug (rw) /usr/src/kernels from kernel-src (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hmtdw (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kernel-debug: Type: HostPath (bare host directory volume) Path: /sys/kernel/debug HostPathType: Directory kernel-src: Type: HostPath (bare host directory volume) Path: /usr/src/kernels HostPathType: Directory lib-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: DirectoryOrCreate tracing: Type: HostPath (bare host directory volume) Path: /sys HostPathType: Directory proc: Type: HostPath (bare host directory volume) Path: /proc HostPathType: Directory cfm: Type: ConfigMap (a volume populated by a ConfigMap) Name: kepler-cfm Optional: false kube-api-access-hmtdw: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: <none> Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 60s default-scheduler Successfully assigned monitoring/kepler-exporter-f2lbq to parasilo-14.rennes.grid5000.fr Warning FailedMount 29s (x7 over 60s) kubelet MountVolume.SetUp failed for volume "kernel-src" : hostPath type check failed: /usr/src/kernels is not a directory
However, even running the logs do not look good:
> kubectl logs kepler-exporter-jdh4c -n monitoring I0714 14:02:12.859441 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0714 14:02:12.865879 1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM? I0714 14:02:12.873527 1 exporter.go:151] Kepler running on version: eba46bc I0714 14:02:12.873554 1 config.go:212] using gCgroup ID in the BPF program: true I0714 14:02:12.873575 1 config.go:214] kernel version: 4.19 I0714 14:02:12.873594 1 exporter.go:171] EnabledBPFBatchDelete: true I0714 14:02:12.873680 1 power.go:53] use sysfs to obtain power I0714 14:02:12.982707 1 watcher.go:67] Using in cluster k8s config W0714 14:02:12.989342 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:12.989402 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope W0714 14:02:14.275674 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:14.275742 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope W0714 14:02:17.382623 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:17.382663 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope W0714 14:02:22.712307 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:22.712355 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope
The queries do not work:
> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total command terminated with exit code 7 > kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "curl localhost:9102/metrics" curl: (7) Failed to connect to localhost port 9102: Connection refused command terminated with exit code 7
Then, I just checked again the previous commands you asked me:
> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/4.19.0-24-amd64; cd /lib/modules/4.19.0-24-amd64/build; ls" total 4484 lrwxrwxrwx 1 root root 38 Apr 29 20:07 build -> /usr/src/linux-headers-4.19.0-24-amd64 drwxr-xr-x 12 root root 4096 May 16 12:22 kernel -rw-r--r-- 1 root root 1143894 Jul 14 10:19 modules.alias -rw-r--r-- 1 root root 1092710 Jul 14 10:19 modules.alias.bin -rw-r--r-- 1 root root 4683 Apr 29 20:07 modules.builtin -rw-r--r-- 1 root root 5999 Jul 14 10:19 modules.builtin.bin -rw-r--r-- 1 root root 436046 Jul 14 10:19 modules.dep -rw-r--r-- 1 root root 594001 Jul 14 10:19 modules.dep.bin -rw-r--r-- 1 root root 456 Jul 14 10:19 modules.devname -rw-r--r-- 1 root root 140020 Apr 29 20:07 modules.order -rw-r--r-- 1 root root 876 Jul 14 10:19 modules.softdep -rw-r--r-- 1 root root 507648 Jul 14 10:19 modules.symbols -rw-r--r-- 1 root root 626742 Jul 14 10:19 modules.symbols.bin lrwxrwxrwx 1 root root 39 Apr 29 20:07 source -> /usr/src/linux-headers-4.19.0-24-common drwxr-xr-x 3 root root 4096 Jul 14 10:19 updates bash: line 1: cd: /lib/modules/4.19.0-24-amd64/build: No such file or directory NGC-DL-CONTAINER-LICENSE afs bin boot dev etc home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var > kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/`uname -r`/build; cd /lib/modules/`uname -r`/; ls" ls: cannot access '/lib/modules/5.10.0-23-amd64/build': No such file or directory bash: line 1: cd: /lib/modules/5.10.0-23-amd64/: No such file or directory NGC-DL-CONTAINER-LICENSE afs bin boot dev etc home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var
And I can see that the pods now have a lot of CrashLoopBackOff:
monitoring kepler-exporter-jdh4c 0/1 CrashLoopBackOff 4 (83s ago) 8m1s monitoring kepler-exporter-rs6lg 0/1 CrashLoopBackOff 4 (79s ago) 8m1s ~/kepler main ?1 kube local adasilva@frennes 16:09:28 > kubectl describe pod kepler-exporter-rs6lg -n monitoring Name: kepler-exporter-rs6lg Namespace: monitoring Priority: 0 Service Account: kepler-sa Node: parasilo-14.rennes.grid5000.fr/172.16.97.14 Start Time: Fri, 14 Jul 2023 16:02:09 +0200 Labels: app.kubernetes.io/component=exporter app.kubernetes.io/name=kepler-exporter controller-revision-hash=54cf48cf9 pod-template-generation=4 sustainable-computing.io/app=kepler Annotations: cni.projectcalico.org/containerID: d3ffb8f7025940f7f3a06d49b68e81276c4be9cda0afa9ce67aaf6090c7eb49e cni.projectcalico.org/podIP: 10.42.2.20/32 cni.projectcalico.org/podIPs: 10.42.2.20/32 Status: Running IP: 10.42.2.20 IPs: IP: 10.42.2.20 Controlled By: DaemonSet/kepler-exporter Containers: kepler-exporter: Container ID: docker://87c70c90e955fa59301770ef9e139aca051e7f0c6358bf87b6f98ac62fdec52f Image: quay.io/sustainable_computing_io/kepler:latest Image ID: docker-pullable://quay.io/sustainable_computing_io/kepler@sha256:7a3c21442015f0ce471aefc8425268d384eae9651f9d2543a8ec4b60be59b3d6 Port: 9102/TCP Host Port: 0/TCP Command: /bin/sh -c Args: /usr/bin/kepler -v=1 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 255 Started: Fri, 14 Jul 2023 16:07:51 +0200 Finished: Fri, 14 Jul 2023 16:08:51 +0200 Ready: False Restart Count: 4 Requests: cpu: 100m memory: 400Mi Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5 Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /etc/config from cfm (ro) /lib/modules from lib-modules (rw) /proc from proc (rw) /sys from tracing (rw) /sys/kernel/debug from kernel-debug (rw) /usr/src/kernels from kernel-src (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2qns (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kernel-debug: Type: HostPath (bare host directory volume) Path: /sys/kernel/debug HostPathType: DirectoryOrCreate kernel-src: Type: HostPath (bare host directory volume) Path: /usr/src/kernels HostPathType: DirectoryOrCreate lib-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: DirectoryOrCreate tracing: Type: HostPath (bare host directory volume) Path: /sys HostPathType: Directory proc: Type: HostPath (bare host directory volume) Path: /proc HostPathType: Directory cfm: Type: ConfigMap (a volume populated by a ConfigMap) Name: kepler-cfm Optional: false kube-api-access-p2qns: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: <none> Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 7m34s default-scheduler Successfully assigned monitoring/kepler-exporter-rs6lg to parasilo-14.rennes.grid5000.fr Normal Pulled 7m31s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.659839237s Normal Pulled 6m28s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.648322644s Normal Pulled 5m14s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.708725285s Warning Unhealthy 4m34s (x3 over 6m34s) kubelet Liveness probe failed: Get "http://10.42.2.20:9102/healthz": dial tcp 10.42.2.20:9102: connect: connection refused Normal Pulling 3m44s (x4 over 7m33s) kubelet Pulling image "quay.io/sustainable_computing_io/kepler:latest" Normal Pulled 3m43s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.658658027s Normal Created 3m42s (x4 over 7m31s) kubelet Created container kepler-exporter Normal Started 3m42s (x4 over 7m31s) kubelet Started container kepler-exporter Warning BackOff 2m9s (x7 over 5m26s) kubelet Back-off restarting failed container
@andersonandrei The problem is RBAC or scc issue. Are you using openshift-based cluster? If you build manifest from the command it should have permission check clusterrole. If your cluster is based on openshift, you need to add option
OPENSHIFT_DEPLOY
to bind user to scc.
reference: https://sustainable-computing.io/installation/kepler/
@rootfs, I just added the
/usr/src/kernels
and the/sys/kernel/debug
, but again I needed to useDirectoryOrCreate
, otherwise the pod did not intiate:> kubectl describe pod kepler-exporter-f2lbq -n monitoring Name: kepler-exporter-f2lbq Namespace: monitoring Priority: 0 Service Account: kepler-sa Node: parasilo-14.rennes.grid5000.fr/172.16.97.14 Start Time: Fri, 14 Jul 2023 16:00:04 +0200 Labels: app.kubernetes.io/component=exporter app.kubernetes.io/name=kepler-exporter controller-revision-hash=765cd98545 pod-template-generation=3 sustainable-computing.io/app=kepler Annotations: <none> Status: Pending IP: IPs: <none> Controlled By: DaemonSet/kepler-exporter Containers: kepler-exporter: Container ID: Image: quay.io/sustainable_computing_io/kepler:latest Image ID: Port: 9102/TCP Host Port: 0/TCP Command: /bin/sh -c Args: /usr/bin/kepler -v=1 State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 100m memory: 400Mi Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5 Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /etc/config from cfm (ro) /lib/modules from lib-modules (rw) /proc from proc (rw) /sys from tracing (rw) /sys/kernel/debug from kernel-debug (rw) /usr/src/kernels from kernel-src (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hmtdw (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kernel-debug: Type: HostPath (bare host directory volume) Path: /sys/kernel/debug HostPathType: Directory kernel-src: Type: HostPath (bare host directory volume) Path: /usr/src/kernels HostPathType: Directory lib-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: DirectoryOrCreate tracing: Type: HostPath (bare host directory volume) Path: /sys HostPathType: Directory proc: Type: HostPath (bare host directory volume) Path: /proc HostPathType: Directory cfm: Type: ConfigMap (a volume populated by a ConfigMap) Name: kepler-cfm Optional: false kube-api-access-hmtdw: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: <none> Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 60s default-scheduler Successfully assigned monitoring/kepler-exporter-f2lbq to parasilo-14.rennes.grid5000.fr Warning FailedMount 29s (x7 over 60s) kubelet MountVolume.SetUp failed for volume "kernel-src" : hostPath type check failed: /usr/src/kernels is not a directory
However, even running the logs do not look good:
> kubectl logs kepler-exporter-jdh4c -n monitoring I0714 14:02:12.859441 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0714 14:02:12.865879 1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM? I0714 14:02:12.873527 1 exporter.go:151] Kepler running on version: eba46bc I0714 14:02:12.873554 1 config.go:212] using gCgroup ID in the BPF program: true I0714 14:02:12.873575 1 config.go:214] kernel version: 4.19 I0714 14:02:12.873594 1 exporter.go:171] EnabledBPFBatchDelete: true I0714 14:02:12.873680 1 power.go:53] use sysfs to obtain power I0714 14:02:12.982707 1 watcher.go:67] Using in cluster k8s config W0714 14:02:12.989342 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:12.989402 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope W0714 14:02:14.275674 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:14.275742 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope W0714 14:02:17.382623 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:17.382663 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope W0714 14:02:22.712307 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope E0714 14:02:22.712355 1 reflector.go:140] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: Failed to watch <unspecified>: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the c luster scope
The queries do not work:
> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total command terminated with exit code 7 > kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "curl localhost:9102/metrics" curl: (7) Failed to connect to localhost port 9102: Connection refused command terminated with exit code 7
Then, I just checked again the previous commands you asked me:
> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/4.19.0-24-amd64; cd /lib/modules/4.19.0-24-amd64/build; ls" total 4484 lrwxrwxrwx 1 root root 38 Apr 29 20:07 build -> /usr/src/linux-headers-4.19.0-24-amd64 drwxr-xr-x 12 root root 4096 May 16 12:22 kernel -rw-r--r-- 1 root root 1143894 Jul 14 10:19 modules.alias -rw-r--r-- 1 root root 1092710 Jul 14 10:19 modules.alias.bin -rw-r--r-- 1 root root 4683 Apr 29 20:07 modules.builtin -rw-r--r-- 1 root root 5999 Jul 14 10:19 modules.builtin.bin -rw-r--r-- 1 root root 436046 Jul 14 10:19 modules.dep -rw-r--r-- 1 root root 594001 Jul 14 10:19 modules.dep.bin -rw-r--r-- 1 root root 456 Jul 14 10:19 modules.devname -rw-r--r-- 1 root root 140020 Apr 29 20:07 modules.order -rw-r--r-- 1 root root 876 Jul 14 10:19 modules.softdep -rw-r--r-- 1 root root 507648 Jul 14 10:19 modules.symbols -rw-r--r-- 1 root root 626742 Jul 14 10:19 modules.symbols.bin lrwxrwxrwx 1 root root 39 Apr 29 20:07 source -> /usr/src/linux-headers-4.19.0-24-common drwxr-xr-x 3 root root 4096 Jul 14 10:19 updates bash: line 1: cd: /lib/modules/4.19.0-24-amd64/build: No such file or directory NGC-DL-CONTAINER-LICENSE afs bin boot dev etc home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var > kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "ls -l /lib/modules/`uname -r`/build; cd /lib/modules/`uname -r`/; ls" ls: cannot access '/lib/modules/5.10.0-23-amd64/build': No such file or directory bash: line 1: cd: /lib/modules/5.10.0-23-amd64/: No such file or directory NGC-DL-CONTAINER-LICENSE afs bin boot dev etc home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var
And I can see that the pods now have a lot of CrashLoopBackOff:
monitoring kepler-exporter-jdh4c 0/1 CrashLoopBackOff 4 (83s ago) 8m1s monitoring kepler-exporter-rs6lg 0/1 CrashLoopBackOff 4 (79s ago) 8m1s ~/kepler main ?1 kube local adasilva@frennes 16:09:28 > kubectl describe pod kepler-exporter-rs6lg -n monitoring Name: kepler-exporter-rs6lg Namespace: monitoring Priority: 0 Service Account: kepler-sa Node: parasilo-14.rennes.grid5000.fr/172.16.97.14 Start Time: Fri, 14 Jul 2023 16:02:09 +0200 Labels: app.kubernetes.io/component=exporter app.kubernetes.io/name=kepler-exporter controller-revision-hash=54cf48cf9 pod-template-generation=4 sustainable-computing.io/app=kepler Annotations: cni.projectcalico.org/containerID: d3ffb8f7025940f7f3a06d49b68e81276c4be9cda0afa9ce67aaf6090c7eb49e cni.projectcalico.org/podIP: 10.42.2.20/32 cni.projectcalico.org/podIPs: 10.42.2.20/32 Status: Running IP: 10.42.2.20 IPs: IP: 10.42.2.20 Controlled By: DaemonSet/kepler-exporter Containers: kepler-exporter: Container ID: docker://87c70c90e955fa59301770ef9e139aca051e7f0c6358bf87b6f98ac62fdec52f Image: quay.io/sustainable_computing_io/kepler:latest Image ID: docker-pullable://quay.io/sustainable_computing_io/kepler@sha256:7a3c21442015f0ce471aefc8425268d384eae9651f9d2543a8ec4b60be59b3d6 Port: 9102/TCP Host Port: 0/TCP Command: /bin/sh -c Args: /usr/bin/kepler -v=1 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 255 Started: Fri, 14 Jul 2023 16:07:51 +0200 Finished: Fri, 14 Jul 2023 16:08:51 +0200 Ready: False Restart Count: 4 Requests: cpu: 100m memory: 400Mi Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5 Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /etc/config from cfm (ro) /lib/modules from lib-modules (rw) /proc from proc (rw) /sys from tracing (rw) /sys/kernel/debug from kernel-debug (rw) /usr/src/kernels from kernel-src (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2qns (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kernel-debug: Type: HostPath (bare host directory volume) Path: /sys/kernel/debug HostPathType: DirectoryOrCreate kernel-src: Type: HostPath (bare host directory volume) Path: /usr/src/kernels HostPathType: DirectoryOrCreate lib-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: DirectoryOrCreate tracing: Type: HostPath (bare host directory volume) Path: /sys HostPathType: Directory proc: Type: HostPath (bare host directory volume) Path: /proc HostPathType: Directory cfm: Type: ConfigMap (a volume populated by a ConfigMap) Name: kepler-cfm Optional: false kube-api-access-p2qns: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: <none> Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 7m34s default-scheduler Successfully assigned monitoring/kepler-exporter-rs6lg to parasilo-14.rennes.grid5000.fr Normal Pulled 7m31s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.659839237s Normal Pulled 6m28s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.648322644s Normal Pulled 5m14s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.708725285s Warning Unhealthy 4m34s (x3 over 6m34s) kubelet Liveness probe failed: Get "http://10.42.2.20:9102/healthz": dial tcp 10.42.2.20:9102: connect: connection refused Normal Pulling 3m44s (x4 over 7m33s) kubelet Pulling image "quay.io/sustainable_computing_io/kepler:latest" Normal Pulled 3m43s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.658658027s Normal Created 3m42s (x4 over 7m31s) kubelet Created container kepler-exporter Normal Started 3m42s (x4 over 7m31s) kubelet Started container kepler-exporter Warning BackOff 2m9s (x7 over 5m26s) kubelet Back-off restarting failed container
@andersonandrei The problem is RBAC or scc issue. Are you using openshift-based cluster? If you build manifest from the command it should have permission check clusterrole. If your cluster is based on openshift, you need to add option
OPENSHIFT_DEPLOY
to bind user to scc.reference: https://sustainable-computing.io/installation/kepler/
@sunya-ch No, I'm not using an openshift-based cluster. Do you have any thoughts about how I can fix this problem with RBAC or scc in this case?
Thanks!
@andersonandrei Could you share the result of
kubectl get clusterrole kepler-clusterrole -oyaml
The pod should be added to the resources list by https://github.com/sustainable-computing-io/kepler/commit/bc981ede83b3fdcaf01fa745ec68e7fe6dea405c for apiserver update.
If pods
is not there, you can just manually add it to the list and restart the pod.
@andersonandrei Could you share the result of
kubectl get clusterrole kepler-clusterrole -o yaml
The pod should be added to the resources list by bc981ed for apiserver update.
If
pods
is not there, you can just manually add it to the list and restart the pod.
@sunya-ch , here is the output of the command:
kubectl get clusterrole kepler-clusterrole -oyaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"labels":{"sustainable-computing.io/app":"kepler"},"name":"kepler-clusterrole"},"rules":[{"apiGroups":[""],"resources":["nodes/metrics","nodes/proxy","nodes/stats","pods"],"verbs":["get","watch","list"]}]}
creationTimestamp: "2023-09-14T09:37:35Z"
labels:
sustainable-computing.io/app: kepler
name: kepler-clusterrole
resourceVersion: "2314"
uid: 47353cf2-9466-457d-b4eb-71449333fe83
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
- nodes/proxy
- nodes/stats
- pods
verbs:
- get
- watch
- list
I just tried again, using the images latest
and latest-libbpf
and the problem persists :(
@andersonandrei can you share your yaml? or can you try this (generated from main branch)
kubectl apply -f https://gist.githubusercontent.com/rootfs/24f30eec07d955df9da1b10f8d403d8d/raw/98ab2d2ab93e85a1ed0d3f921f40a2ef013babcd/kepler-eks-bm.yaml
@rootfs , here is the file I'm using:
apiVersion: v1
kind: Namespace
metadata:
labels:
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/warn: privileged
security.openshift.io/scc.podSecurityLabelSync: "false"
sustainable-computing.io/app: kepler
name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
sustainable-computing.io/app: kepler
name: kepler-sa
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
sustainable-computing.io/app: kepler
name: prometheus-k8s
namespace: monitoring
rules:
- apiGroups:
- ""
resources:
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
sustainable-computing.io/app: kepler
name: kepler-clusterrole-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kepler-clusterrole
subjects:
- kind: ServiceAccount
name: kepler-sa
namespace: monitoring
---
apiVersion: v1
data:
BIND_ADDRESS: 0.0.0.0:9102
CGROUP_METRICS: '*'
CPU_ARCH_OVERRIDE: ""
ENABLE_EBPF_CGROUPID: "true"
ENABLE_GPU: "true"
ENABLE_PROCESS_METRICS: "false"
EXPOSE_CGROUP_METRICS: "true"
EXPOSE_HW_COUNTER_METRICS: "true"
EXPOSE_IRQ_COUNTER_METRICS: "true"
EXPOSE_KUBELET_METRICS: "true"
KEPLER_LOG_LEVEL: "1"
KEPLER_NAMESPACE: monitoring
METRIC_PATH: /metrics
MODEL_CONFIG: |
CONTAINER_COMPONENTS_ESTIMATOR=false
# by default we use buildin weight file
# CONTAINER_COMPONENTS_INIT_URL=https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CgroupOnly/ScikitMixed/ScikitMixed.json
kind: ConfigMap
metadata:
labels:
sustainable-computing.io/app: kepler
name: kepler-cfm
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
name: kepler-exporter
namespace: monitoring
spec:
clusterIP: None
ports:
- name: http
port: 9102
targetPort: http
selector:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
sustainable-computing.io/app: kepler
name: kepler-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
template:
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
spec:
containers:
- args:
- /usr/bin/kepler -v=1 -kernel-source-dir=/usr/share/kepler/kernel_sources
command:
- /bin/sh
- -c
env:
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
image: quay.io/sustainable_computing_io/kepler:latest
imagePullPolicy: Always
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthz
port: 9102
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 10
name: kepler-exporter
ports:
- containerPort: 9102
name: http
resources:
requests:
cpu: 100m
memory: 400Mi
securityContext:
privileged: true
volumeMounts:
- mountPath: /lib/modules
name: lib-modules
- mountPath: /sys
name: tracing
- mountPath: /proc
name: proc
- mountPath: /etc/kepler/kepler.config
name: cfm
readOnly: true
dnsPolicy: ClusterFirstWithHostNet
serviceAccountName: kepler-sa
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
volumes:
- hostPath:
path: /lib/modules
type: DirectoryOrCreate
name: lib-modules
- hostPath:
path: /sys
type: Directory
name: tracing
- hostPath:
path: /proc
type: Directory
name: proc
- configMap:
name: kepler-cfm
name: cfm
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
name: kepler-exporter
namespace: monitoring
spec:
endpoints:
- interval: 3s
port: http
relabelings:
- action: replace
regex: (.*)
replacement: $1
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: instance
scheme: http
jobLabel: app.kubernetes.io/name
namespaceSelector:
matchNames:
- monitoring
selector:
matchLabels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
@andersonandrei can you share your yaml? or can you try this (generated from main branch)
kubectl apply -f https://gist.githubusercontent.com/rootfs/24f30eec07d955df9da1b10f8d403d8d/raw/98ab2d2ab93e85a1ed0d3f921f40a2ef013babcd/kepler-eks-bm.yaml
It still the same for me :(
Describing the pod:
> kubectl describe pod kepler-exporter-fx9w8 -n kepler
Name: kepler-exporter-fx9w8
Namespace: kepler
Priority: 0
Service Account: kepler-sa
Node: troll-3.grenoble.grid5000.fr/172.16.22.3
Start Time: Thu, 14 Sep 2023 16:08:00 +0200
Labels: app.kubernetes.io/component=exporter
app.kubernetes.io/name=kepler-exporter
controller-revision-hash=694f8b95f9
pod-template-generation=1
sustainable-computing.io/app=kepler
Annotations: cni.projectcalico.org/containerID: ee33c9db3edec2668ba14f8ec26f528465c69b0ad89b83d85bc653b14da4222f
cni.projectcalico.org/podIP: 10.42.1.34/32
cni.projectcalico.org/podIPs: 10.42.1.34/32
Status: Running
IP: 10.42.1.34
IPs:
IP: 10.42.1.34
Controlled By: DaemonSet/kepler-exporter
Containers:
kepler-exporter:
Container ID: docker://5a470911f3e43b8ce3a46b565807295a2f6a9224ab97c60d301937fadde285da
Image: quay.io/sustainable_computing_io/kepler:latest
Image ID: docker-pullable://quay.io/sustainable_computing_io/kepler@sha256:ba56b57466790a2dfb785e4017715bfc3d7c46f059afc029d3e7b03511d69eef
Port: 9102/TCP
Host Port: 0/TCP
Command:
/bin/sh
-c
Args:
/usr/bin/kepler -v=1 -kernel-source-dir=/usr/share/kepler/kernel_sources -redfish-cred-file-path=/etc/redfish/redfish.csv
State: Running
Started: Thu, 14 Sep 2023 16:08:02 +0200
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 400Mi
Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
Environment:
NODE_IP: (v1:status.hostIP)
NODE_NAME: (v1:spec.nodeName)
Mounts:
/etc/kepler/kepler.config from cfm (ro)
/etc/redfish from redfish (ro)
/lib/modules from lib-modules (rw)
/proc from proc (rw)
/sys from tracing (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qglfh (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType: Directory
tracing:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType: Directory
cfm:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kepler-cfm
Optional: false
redfish:
Type: Secret (a volume populated by a Secret)
SecretName: redfish-4kh9d7bc7m
Optional: false
kube-api-access-qglfh:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 87s default-scheduler Successfully assigned kepler/kepler-exporter-fx9w8 to troll-3.grenoble.grid5000.fr
Normal Pulling 87s kubelet Pulling image "quay.io/sustainable_computing_io/kepler:latest"
Normal Pulled 85s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 1.768524217s
Normal Created 85s kubelet Created container kepler-exporter
Normal Started 85s kubelet Started container kepler-exporter
Logs:
> kubectl logs kepler-exporter-fx9w8 -n kepler
I0914 14:08:02.838663 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0914 14:08:02.845285 1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0914 14:08:02.853165 1 exporter.go:158] Kepler running on version: 5f33240
I0914 14:08:02.853179 1 config.go:272] using gCgroup ID in the BPF program: true
I0914 14:08:02.853198 1 config.go:274] kernel version: 4.19
I0914 14:08:02.853310 1 config.go:299] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0914 14:08:02.853316 1 exporter.go:170] LibbpfBuilt: false, BccBuilt: true
I0914 14:08:02.853348 1 config.go:205] kernel source dir is set to /usr/share/kepler/kernel_sources
I0914 14:08:02.853406 1 exporter.go:189] EnabledBPFBatchDelete: true
I0914 14:08:02.853430 1 power.go:54] use sysfs to obtain power
I0914 14:08:02.853452 1 redfish.go:173] failed to initialize node credential: no supported node credential implementation
I0914 14:08:02.856971 1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0914 14:08:02.883220 1 exporter.go:204] Initializing the GPU collector
I0914 14:08:08.888921 1 watcher.go:66] Using in cluster k8s config
modprobe: FATAL: Module kheaders not found in directory /lib/modules/4.19.0-25-amd64
chdir(/lib/modules/4.19.0-25-amd64/build): No such file or directory
I0914 14:08:09.019217 1 bcc_attacher.go:80] failed to attach the bpf program: <nil>
I0914 14:08:09.019250 1 bcc_attacher.go:159] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to attach the bpf program: <nil>, from default kernel source.
I0914 14:08:09.019289 1 bcc_attacher.go:162] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64
bpf: Failed to load program: Invalid argument
I0914 14:08:09.647169 1 bcc_attacher.go:166] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64"
I0914 14:08:09.647198 1 bcc_attacher.go:162] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64
bpf: Failed to load program: Invalid argument
I0914 14:08:10.213073 1 bcc_attacher.go:166] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64"
I0914 14:08:10.213115 1 bcc_attacher.go:174] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0914 14:08:10.213184 1 exporter.go:241] failed to start : failed to attach bpf assets: failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0914 14:08:10.213274 1 container_energy.go:109] Using the Ratio/DynPower Power Model to estimate Container Platform Power
I0914 14:08:10.213289 1 container_energy.go:118] Using the Ratio/DynPower Power Model to estimate Container Component Power
I0914 14:08:10.213297 1 process_power.go:108] Using the Ratio/DynPower Power Model to estimate Process Platform Power
I0914 14:08:10.213321 1 process_power.go:117] Using the Ratio/DynPower Power Model to estimate Process Component Power
I0914 14:08:10.213497 1 node_platform_energy.go:53] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power
Retrieving info:
> kubectl exec -ti -n kepler daemonset/kepler-exporter -- bash -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total
# HELP kepler_container_core_joules_total Aggregated RAPL value in core in joules
# TYPE kepler_container_core_joules_total counter
kepler_container_core_joules_total{command="",container_id="1e6d5a6a71c93151b7a2e5a04a25debd10ad87bee0cc80f683ab094afdb15818",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-10-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="1e6d5a6a71c93151b7a2e5a04a25debd10ad87bee0cc80f683ab094afdb15818",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-10-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="d241eb6b91d008a94c63ef826edf23b47f60648a634e8d608eac18692fddb567",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-9-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="d241eb6b91d008a94c63ef826edf23b47f60648a634e8d608eac18692fddb567",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-9-guest-linpack"} 0
@andersonandrei can you share your yaml? or can you try this (generated from main branch)
kubectl apply -f https://gist.githubusercontent.com/rootfs/24f30eec07d955df9da1b10f8d403d8d/raw/98ab2d2ab93e85a1ed0d3f921f40a2ef013babcd/kepler-eks-bm.yaml
Agree to try this yaml.
RBAC looks good to me... Before changing the yaml, could you also share the result of following command to last confirm about rbac.
kubectl get clusterrolebinding kepler-clusterrole-binding -oyaml
kubectl auth can-i list pod --as system:serviceaccount:monitoring:kepler-sa
Is the log still showing?
W0714 14:02:12.989342 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>:
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
@andersonandrei can you share your yaml? or can you try this (generated from main branch)
kubectl apply -f https://gist.githubusercontent.com/rootfs/24f30eec07d955df9da1b10f8d403d8d/raw/98ab2d2ab93e85a1ed0d3f921f40a2ef013babcd/kepler-eks-bm.yaml
Agree to try this yaml.
RBAC looks good to me... Before changing the yaml, could you also share the result of following command to last confirm about rbac.
kubectl get clusterrolebinding kepler-clusterrole-binding -oyaml kubectl auth can-i list pod --as system:serviceaccount:monitoring:kepler-sa
Is the log still showing?
W0714 14:02:12.989342 1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope
@sunya-ch , I just changed back to my original yaml to do the checks you suggested. Here are the tests:
clusterrolebinding:
> kubectl get clusterrolebinding kepler-clusterrole-binding -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRoleBinding","metadata":{"annotations":{},"labels":{"sustainable-computing.io/app":"kepler"},"name":"kepler-clusterrole-binding"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"kepler-clusterrole"},"subjects":[{"kind":"ServiceAccount","name":"kepler-sa","namespace":"monitoring"}]}
creationTimestamp: "2023-09-14T15:40:08Z"
labels:
sustainable-computing.io/app: kepler
name: kepler-clusterrole-binding
resourceVersion: "33876"
uid: d76b8c91-277a-471a-8848-5aa69fb92cce
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kepler-clusterrole
subjects:
- kind: ServiceAccount
name: kepler-sa
namespace: monitoring
Authorization:
> kubectl auth can-i list pod --as system:serviceaccount:monitoring:kepler-sa
yes
Kepler logs are not showing anymore the message ' cannot list resource "pods" '. Here are the full logs:
> kubectl logs kepler-exporter-rtcb2 -n monitoring
I0914 15:40:11.639224 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0914 15:40:11.646136 1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0914 15:40:11.652823 1 exporter.go:158] Kepler running on version: 5f33240
I0914 15:40:11.652836 1 config.go:272] using gCgroup ID in the BPF program: true
I0914 15:40:11.652856 1 config.go:274] kernel version: 4.19
I0914 15:40:11.652907 1 config.go:299] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0914 15:40:11.652912 1 exporter.go:170] LibbpfBuilt: false, BccBuilt: true
I0914 15:40:11.652932 1 config.go:205] kernel source dir is set to /usr/share/kepler/kernel_sources
I0914 15:40:11.652980 1 exporter.go:189] EnabledBPFBatchDelete: true
I0914 15:40:11.653014 1 power.go:54] use sysfs to obtain power
I0914 15:40:11.653022 1 redfish.go:169] failed to get redfish credential file path
I0914 15:40:11.656419 1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0914 15:40:11.680453 1 exporter.go:204] Initializing the GPU collector
I0914 15:40:17.686068 1 watcher.go:66] Using in cluster k8s config
modprobe: FATAL: Module kheaders not found in directory /lib/modules/4.19.0-25-amd64
chdir(/lib/modules/4.19.0-25-amd64/build): No such file or directory
I0914 15:40:17.814031 1 bcc_attacher.go:80] failed to attach the bpf program: <nil>
I0914 15:40:17.814043 1 bcc_attacher.go:159] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to attach the bpf program: <nil>, from default kernel source.
I0914 15:40:17.814053 1 bcc_attacher.go:162] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64
bpf: Failed to load program: Invalid argument
I0914 15:40:18.419320 1 bcc_attacher.go:166] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64"
I0914 15:40:18.419346 1 bcc_attacher.go:162] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64
bpf: Failed to load program: Invalid argument
I0914 15:40:18.980164 1 bcc_attacher.go:166] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64"
I0914 15:40:18.980215 1 bcc_attacher.go:174] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0914 15:40:18.980246 1 exporter.go:241] failed to start : failed to attach bpf assets: failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0914 15:40:18.980393 1 container_energy.go:109] Using the Ratio/DynPower Power Model to estimate Container Platform Power
I0914 15:40:18.980403 1 container_energy.go:118] Using the Ratio/DynPower Power Model to estimate Container Component Power
I0914 15:40:18.980430 1 process_power.go:108] Using the Ratio/DynPower Power Model to estimate Process Platform Power
I0914 15:40:18.980445 1 process_power.go:117] Using the Ratio/DynPower Power Model to estimate Process Component Power
I0914 15:40:18.980608 1 node_platform_energy.go:53] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power
I0914 15:40:18.980798 1 exporter.go:276] Started Kepler in 7.327992991s
@andersonandrei can you try the kepler:latest-libbpf
image?
@andersonandrei can you try the
kepler:latest-libbpf
image?
@rootfs Yes,
Describing the pod:
> kubectl describe pod kepler-exporter-wljzc -n monitoring
Name: kepler-exporter-wljzc
Namespace: monitoring
Priority: 0
Service Account: kepler-sa
Node: troll-3.grenoble.grid5000.fr/172.16.22.3
Start Time: Mon, 18 Sep 2023 14:59:24 +0200
Labels: app.kubernetes.io/component=exporter
app.kubernetes.io/name=kepler-exporter
controller-revision-hash=7f7468c7b
pod-template-generation=1
sustainable-computing.io/app=kepler
Annotations: cni.projectcalico.org/containerID: 8f3c015bd37497a3bdf2543da39f4dfca054dc6bbca6dc88dcdde4a5108334d9
cni.projectcalico.org/podIP: 10.42.1.28/32
cni.projectcalico.org/podIPs: 10.42.1.28/32
Status: Running
IP: 10.42.1.28
IPs:
IP: 10.42.1.28
Controlled By: DaemonSet/kepler-exporter
Containers:
kepler-exporter:
Container ID: docker://9313c27196247028e6e1006f21f073e965606b2f0714630cd292d676a5a4a093
Image: quay.io/sustainable_computing_io/kepler:latest-libbpf
Image ID: docker-pullable://quay.io/sustainable_computing_io/kepler@sha256:b71cfd5f5c291dfc59566b84dda2b7160e06d6b79455842853732ef5ac0a2a2f
Port: 9102/TCP
Host Port: 0/TCP
Command:
/bin/sh
-c
Args:
/usr/bin/kepler -v=1 -kernel-source-dir=/usr/share/kepler/kernel_sources
State: Running
Started: Mon, 18 Sep 2023 14:59:53 +0200
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 400Mi
Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
Environment:
NODE_IP: (v1:status.hostIP)
NODE_NAME: (v1:spec.nodeName)
Mounts:
/etc/kepler/kepler.config from cfm (ro)
/lib/modules from lib-modules (rw)
/proc from proc (rw)
/sys from tracing (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z8dl8 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType: DirectoryOrCreate
tracing:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType: Directory
cfm:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kepler-cfm
Optional: false
kube-api-access-z8dl8:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m21s default-scheduler Successfully assigned monitoring/kepler-exporter-wljzc to troll-3.grenoble.grid5000.fr
Normal Pulling 4m21s kubelet Pulling image "quay.io/sustainable_computing_io/kepler:latest-libbpf"
Normal Pulled 3m55s kubelet Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest-libbpf" in 25.653990033s
Normal Created 3m53s kubelet Created container kepler-exporter
Normal Started 3m53s kubelet Started container kepler-exporter
Getting the logs:
> kubectl logs kepler-exporter-wljzc -n monitoring
I0918 12:59:53.958956 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0918 12:59:53.964905 1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0918 12:59:53.972969 1 exporter.go:158] Kepler running on version: dfe0145
I0918 12:59:53.972988 1 config.go:272] using gCgroup ID in the BPF program: true
I0918 12:59:53.973005 1 config.go:274] kernel version: 4.19
I0918 12:59:53.973088 1 config.go:299] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0918 12:59:53.973095 1 exporter.go:170] LibbpfBuilt: true, BccBuilt: false
I0918 12:59:53.973102 1 exporter.go:189] EnabledBPFBatchDelete: true
I0918 12:59:53.973141 1 power.go:54] use sysfs to obtain power
I0918 12:59:53.973158 1 redfish.go:169] failed to get redfish credential file path
I0918 12:59:53.977942 1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0918 12:59:54.002536 1 exporter.go:204] Initializing the GPU collector
I0918 13:00:00.007965 1 watcher.go:66] Using in cluster k8s config
libbpf: loading /var/lib/kepler/bpfassets/amd64_kepler.bpf.o
libbpf: elf: section(3) tracepoint/sched/sched_switch, size 2456, link 0, flags 6, type=1
libbpf: sec 'tracepoint/sched/sched_switch': found program 'kepler_trace' at insn offset 0 (0 bytes), code size 307 insns (2456 bytes)
libbpf: elf: section(4) .reltracepoint/sched/sched_switch, size 384, link 27, flags 40, type=9
libbpf: elf: section(5) tracepoint/irq/softirq_entry, size 144, link 0, flags 6, type=1
libbpf: sec 'tracepoint/irq/softirq_entry': found program 'kepler_irq_trace' at insn offset 0 (0 bytes), code size 18 insns (144 bytes)
libbpf: elf: section(6) .reltracepoint/irq/softirq_entry, size 16, link 27, flags 40, type=9
libbpf: elf: section(7) .data, size 8, link 0, flags 3, type=1
libbpf: elf: section(8) .maps, size 352, link 0, flags 3, type=1
libbpf: elf: section(9) license, size 4, link 0, flags 3, type=1
libbpf: license of /var/lib/kepler/bpfassets/amd64_kepler.bpf.o is GPL
libbpf: elf: section(18) .BTF, size 5979, link 0, flags 0, type=1
libbpf: elf: section(20) .BTF.ext, size 2120, link 0, flags 0, type=1
libbpf: elf: section(27) .symtab, size 1056, link 1, flags 0, type=2
libbpf: looking for externs among 44 symbols...
libbpf: collected 0 externs total
libbpf: map 'processes': at sec_idx 8, offset 0.
libbpf: map 'processes': found type = 1.
libbpf: map 'processes': found key [6], sz = 4.
libbpf: map 'processes': found value [10], sz = 88.
libbpf: map 'processes': found max_entries = 32768.
libbpf: map 'pid_time': at sec_idx 8, offset 32.
libbpf: map 'pid_time': found type = 1.
libbpf: map 'pid_time': found key [6], sz = 4.
libbpf: map 'pid_time': found value [12], sz = 8.
libbpf: map 'pid_time': found max_entries = 32768.
libbpf: map 'cpu_cycles_hc_reader': at sec_idx 8, offset 64.
libbpf: map 'cpu_cycles_hc_reader': found type = 4.
libbpf: map 'cpu_cycles_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_cycles_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_cycles_hc_reader': found max_entries = 128.
libbpf: map 'cpu_cycles': at sec_idx 8, offset 96.
libbpf: map 'cpu_cycles': found type = 2.
libbpf: map 'cpu_cycles': found key [6], sz = 4.
libbpf: map 'cpu_cycles': found value [12], sz = 8.
libbpf: map 'cpu_cycles': found max_entries = 128.
libbpf: map 'cpu_ref_cycles_hc_reader': at sec_idx 8, offset 128.
libbpf: map 'cpu_ref_cycles_hc_reader': found type = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found max_entries = 128.
libbpf: map 'cpu_ref_cycles': at sec_idx 8, offset 160.
libbpf: map 'cpu_ref_cycles': found type = 2.
libbpf: map 'cpu_ref_cycles': found key [6], sz = 4.
libbpf: map 'cpu_ref_cycles': found value [12], sz = 8.
libbpf: map 'cpu_ref_cycles': found max_entries = 128.
libbpf: map 'cpu_instructions_hc_reader': at sec_idx 8, offset 192.
libbpf: map 'cpu_instructions_hc_reader': found type = 4.
libbpf: map 'cpu_instructions_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_instructions_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_instructions_hc_reader': found max_entries = 128.
libbpf: map 'cpu_instructions': at sec_idx 8, offset 224.
libbpf: map 'cpu_instructions': found type = 2.
libbpf: map 'cpu_instructions': found key [6], sz = 4.
libbpf: map 'cpu_instructions': found value [12], sz = 8.
libbpf: map 'cpu_instructions': found max_entries = 128.
libbpf: map 'cache_miss_hc_reader': at sec_idx 8, offset 256.
libbpf: map 'cache_miss_hc_reader': found type = 4.
libbpf: map 'cache_miss_hc_reader': found key [2], sz = 4.
libbpf: map 'cache_miss_hc_reader': found value [6], sz = 4.
libbpf: map 'cache_miss_hc_reader': found max_entries = 128.
libbpf: map 'cache_miss': at sec_idx 8, offset 288.
libbpf: map 'cache_miss': found type = 2.
libbpf: map 'cache_miss': found key [6], sz = 4.
libbpf: map 'cache_miss': found value [12], sz = 8.
libbpf: map 'cache_miss': found max_entries = 128.
libbpf: map 'cpu_freq_array': at sec_idx 8, offset 320.
libbpf: map 'cpu_freq_array': found type = 2.
libbpf: map 'cpu_freq_array': found key [6], sz = 4.
libbpf: map 'cpu_freq_array': found value [6], sz = 4.
libbpf: map 'cpu_freq_array': found max_entries = 128.
libbpf: map 'amd64_ke.data' (global data): at sec_idx 7, offset 0, flags 400.
libbpf: map 11 is "amd64_ke.data"
libbpf: sec '.reltracepoint/sched/sched_switch': collecting relocation for section(3) 'tracepoint/sched/sched_switch'
libbpf: sec '.reltracepoint/sched/sched_switch': relo #0: insn #1 against 'counter'
libbpf: prog 'kepler_trace': found data map 11 (amd64_ke.data, sec 7, off 0) for insn 1
libbpf: sec '.reltracepoint/sched/sched_switch': relo #1: insn #12 against 'sample_rate'
libbpf: prog 'kepler_trace': found data map 11 (amd64_ke.data, sec 7, off 0) for insn 12
libbpf: sec '.reltracepoint/sched/sched_switch': relo #2: insn #32 against 'cpu_cycles_hc_reader'
libbpf: prog 'kepler_trace': found map 2 (cpu_cycles_hc_reader, sec 8, off 64) for insn #32
libbpf: sec '.reltracepoint/sched/sched_switch': relo #3: insn #51 against 'cpu_cycles'
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 8, off 96) for insn #51
libbpf: sec '.reltracepoint/sched/sched_switch': relo #4: insn #65 against 'cpu_cycles'
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 8, off 96) for insn #65
libbpf: sec '.reltracepoint/sched/sched_switch': relo #5: insn #70 against 'cpu_ref_cycles_hc_reader'
libbpf: prog 'kepler_trace': found map 4 (cpu_ref_cycles_hc_reader, sec 8, off 128) for insn #70
libbpf: sec '.reltracepoint/sched/sched_switch': relo #6: insn #83 against 'cpu_ref_cycles'
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 8, off 160) for insn #83
libbpf: sec '.reltracepoint/sched/sched_switch': relo #7: insn #97 against 'cpu_ref_cycles'
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 8, off 160) for insn #97
libbpf: sec '.reltracepoint/sched/sched_switch': relo #8: insn #102 against 'cpu_instructions_hc_reader'
libbpf: prog 'kepler_trace': found map 6 (cpu_instructions_hc_reader, sec 8, off 192) for insn #102
libbpf: sec '.reltracepoint/sched/sched_switch': relo #9: insn #119 against 'cpu_instructions'
libbpf: prog 'kepler_trace': found map 7 (cpu_instructions, sec 8, off 224) for insn #119
libbpf: sec '.reltracepoint/sched/sched_switch': relo #10: insn #132 against 'cpu_instructions'
libbpf: prog 'kepler_trace': found map 7 (cpu_instructions, sec 8, off 224) for insn #132
libbpf: sec '.reltracepoint/sched/sched_switch': relo #11: insn #137 against 'cache_miss_hc_reader'
libbpf: prog 'kepler_trace': found map 8 (cache_miss_hc_reader, sec 8, off 256) for insn #137
libbpf: sec '.reltracepoint/sched/sched_switch': relo #12: insn #149 against 'cache_miss'
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 8, off 288) for insn #149
libbpf: sec '.reltracepoint/sched/sched_switch': relo #13: insn #163 against 'cache_miss'
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 8, off 288) for insn #163
libbpf: sec '.reltracepoint/sched/sched_switch': relo #14: insn #171 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 8, off 320) for insn #171
libbpf: sec '.reltracepoint/sched/sched_switch': relo #15: insn #185 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 8, off 320) for insn #185
libbpf: sec '.reltracepoint/sched/sched_switch': relo #16: insn #197 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 8, off 320) for insn #197
libbpf: sec '.reltracepoint/sched/sched_switch': relo #17: insn #221 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 8, off 320) for insn #221
libbpf: sec '.reltracepoint/sched/sched_switch': relo #18: insn #230 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 8, off 32) for insn #230
libbpf: sec '.reltracepoint/sched/sched_switch': relo #19: insn #238 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 8, off 32) for insn #238
libbpf: sec '.reltracepoint/sched/sched_switch': relo #20: insn #250 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 8, off 32) for insn #250
libbpf: sec '.reltracepoint/sched/sched_switch': relo #21: insn #256 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 8, off 0) for insn #256
libbpf: sec '.reltracepoint/sched/sched_switch': relo #22: insn #276 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 8, off 0) for insn #276
libbpf: sec '.reltracepoint/sched/sched_switch': relo #23: insn #302 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 8, off 0) for insn #302
libbpf: sec '.reltracepoint/irq/softirq_entry': collecting relocation for section(5) 'tracepoint/irq/softirq_entry'
libbpf: sec '.reltracepoint/irq/softirq_entry': relo #0: insn #5 against 'processes'
libbpf: prog 'kepler_irq_trace': found map 0 (processes, sec 8, off 0) for insn #5
libbpf: map 'processes': created successfully, fd=10
libbpf: map 'pid_time': created successfully, fd=11
libbpf: map 'cpu_cycles_hc_reader': created successfully, fd=12
libbpf: map 'cpu_cycles': created successfully, fd=13
libbpf: map 'cpu_ref_cycles_hc_reader': created successfully, fd=14
libbpf: map 'cpu_ref_cycles': created successfully, fd=15
libbpf: map 'cpu_instructions_hc_reader': created successfully, fd=16
libbpf: map 'cpu_instructions': created successfully, fd=17
libbpf: map 'cache_miss_hc_reader': created successfully, fd=18
libbpf: map 'cache_miss': created successfully, fd=19
libbpf: map 'cpu_freq_array': created successfully, fd=20
libbpf: map 'amd64_ke.data': skipped auto-creating...
libbpf: prog 'kepler_trace': relo #0: poisoning insn #1 that loads map #11 'amd64_ke.data'
libbpf: prog 'kepler_trace': relo #1: poisoning insn #12 that loads map #11 'amd64_ke.data'
libbpf: prog 'kepler_trace': BPF program load failed: Invalid argument
libbpf: prog 'kepler_trace': -- BEGIN PROG LOAD LOG --
0: (bf) r8 = r1
1: <invalid BPF map reference>
BPF map 'amd64_ke.data' is referenced but wasn't created
-- END PROG LOAD LOG --
libbpf: prog 'kepler_trace': failed to load: -22
libbpf: failed to load object '/var/lib/kepler/bpfassets/amd64_kepler.bpf.o'
libbpf: prog 'kepler_trace': can't attach BPF program w/o FD (did you load it?)
libbpf: prog 'kepler_trace': failed to attach to tracepoint 'sched/sched_switch': Invalid argument
I0918 13:00:00.139399 1 bpf_perf.go:132] failed to attach bpf with libbpf: failed to attach sched/sched_switch: failed to attach tracepoint sched_switch to program kepler_trace: invalid argument
I0918 13:00:00.139416 1 exporter.go:241] failed to start : failed to attach bpf assets: no bcc build tag
I0918 13:00:00.139484 1 container_energy.go:109] Using the Ratio/DynPower Power Model to estimate Container Platform Power
I0918 13:00:00.139495 1 container_energy.go:118] Using the Ratio/DynPower Power Model to estimate Container Component Power
I0918 13:00:00.139502 1 process_power.go:108] Using the Ratio/DynPower Power Model to estimate Process Platform Power
I0918 13:00:00.139513 1 process_power.go:117] Using the Ratio/DynPower Power Model to estimate Process Component Power
I0918 13:00:00.139688 1 node_platform_energy.go:53] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power
I0918 13:00:00.139804 1 exporter.go:276] Started Kepler in 6.166857795s
And then, looking the outputs of Kepler:
> kubectl exec -ti -n monitoring daemonset/kepler-exporter -- bash -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total
# HELP kepler_container_core_joules_total Aggregated RAPL value in core in joules
# TYPE kepler_container_core_joules_total counter
kepler_container_core_joules_total{command="",container_id="2696397f01e6ad716f59037da407d9c53e4a4504981b4bf299b2e5973b81f872",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-4-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="2696397f01e6ad716f59037da407d9c53e4a4504981b4bf299b2e5973b81f872",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-4-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="3a71a1a3b2c59d457b2dd73eae974023024a92ea68bcf49163d9bad3030de682",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-5-guest-matmul"} 0
kepler_container_core_joules_total{command="",container_id="3a71a1a3b2c59d457b2dd73eae974023024a92ea68bcf49163d9bad3030de682",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-5-guest-matmul"} 0
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
What happened?
Hello, I'm trying to use Kepler now on a machine with access to the counters. But it seems not to be working. On my VMs, I can see it working with the estimations, but now that I'm deploying it in these new machines, I just see 0s as the measurements.
I tried to install Kepler by helm chart or by building it and applying the deployment file afterwords, but I had no success.
When I install it with helm, I can see the following logs:
Then, when I query:
When I try to build by myself using
make build-manifest OPTS="PROMETHEUS_DEPLOY"
I can see in the logs:What is weird is that it complains about the /lib/modules, which are installed in both of machines that I'm using:
And finally, the result of the query is the same as above.
Can you help me, please?
PS: In fact, my goal is not to have the measurements from the real countes, I want to validate the Kepler's estimation by crossing them with the powermeters that are installed in these machines. So, if possible, I would like to keep using the estimations. But I also don't know how to specify that.
Can you help me to solve both issues (the main one and the PS), please?
Thank you very much!!
What did you expect to happen?
To get the estimations from Kepler.
How can we reproduce it (as minimally and precisely as possible)?
Following the commands I exemplified above.
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)