sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.17k stars 184 forks source link

Kepler cgroup metrics are zeroes #860

Closed tobby-yuan closed 8 months ago

tobby-yuan commented 1 year ago

What happened?

Hi everyone, This is my VM Environment

But my some metrics(e.g. kepler_container_cgroupfs_cpu_usage_us_total, kepler_container_core_joules_total and so on) is alway 0. The following is my kepler log

I0810 13:36:11.444138       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0810 13:36:11.456218       1 exporter.go:155] Kepler running on version: a7a6cb1
I0810 13:36:11.456299       1 config.go:258] using gCgroup ID in the BPF program: true
I0810 13:36:11.456394       1 config.go:260] kernel version: 5.15
I0810 13:36:11.456441       1 exporter.go:179] EnabledBPFBatchDelete: true
I0810 13:36:11.456485       1 rapl_msr_util.go:129] failed to open path /dev/cpu/0/msr: no such file or directory
I0810 13:36:11.456556       1 power.go:64] Not able to obtain power, use estimate method
I0810 13:36:11.456585       1 redfish.go:169] failed to get redfish credential file path
I0810 13:36:11.456601       1 power.go:55] use acpi to obtain power
I0810 13:36:11.456790       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0810 13:36:11.467127       1 exporter.go:198] Initializing the GPU collector
I0810 13:36:17.469943       1 watcher.go:66] Using in cluster k8s config
I0810 13:36:17.571852       1 bpf_perf.go:123] LibbpfBuilt: false, BccBuilt: true
cannot attach kprobe, probe entry may not exist
I0810 13:36:18.414147       1 bcc_attacher.go:186] Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=4 -DSET_GROUP_ID]
I0810 13:36:18.440766       1 exporter.go:251] Started Kepler in 6.984594733s

It would affect estimate energy consumed(kepler_container_joules_total), right??

What did you expect to happen?

.

How can we reproduce it (as minimally and precisely as possible)?

.

Anything else we need to know?

No response

Kepler image tag

Kubernetes version

```console $ kubectl version # paste output here ```

Cloud provider or bare metal

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} # paste output here # provide kepler deployment description $ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} ``` For standalone: # put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

jokuniew commented 1 year ago

I have the same issue with bare-metal setup ubuntu 22.04.1 LTS Linux 5.15.0-43-generic kind v0.20.0 go1.20.4 linux/amd64 kubectl v1.27.4 cgroup: v2 CRI: RuntimeName: containerd RuntimeVersion: 1.6.22 all paths (/proc, /usr/src) mounted as in guide https://sustainable-computing.io/installation/local-cluster/

I0811 15:22:08.623322       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0811 15:22:08.665991       1 exporter.go:155] Kepler running on version: a7a6cb1
I0811 15:22:08.666020       1 config.go:258] using gCgroup ID in the BPF program: true
I0811 15:22:08.666055       1 config.go:260] kernel version: 5.15
I0811 15:22:08.666083       1 exporter.go:179] EnabledBPFBatchDelete: true
I0811 15:22:08.666175       1 power.go:53] use sysfs to obtain power
I0811 15:22:08.666195       1 redfish.go:169] failed to get redfish credential file path
I0811 15:22:08.666210       1 power.go:55] use acpi to obtain power
I0811 15:22:08.676428       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0811 15:22:08.793791       1 exporter.go:198] Initializing the GPU collector
I0811 15:22:14.799490       1 watcher.go:66] Using in cluster k8s config
I0811 15:22:14.901204       1 bpf_perf.go:123] LibbpfBuilt: false, BccBuilt: true
cannot attach kprobe, probe entry may not exist
I0811 15:22:15.761644       1 bcc_attacher.go:186] Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSET_GROUP_ID]
I0811 15:22:15.777384       1 exporter.go:251] Started Kepler in 7.111410022s

The log that concerns me the most is cannot attach kprobe, probe entry may not exist I didn't have this error on ubuntu 20.04 with kernel 5.4.

I've seen in issues that @rootfs recommends to check grep finish_task_switch /proc/kallsyms and with kernel 5.14 i have values

ffffffffa96f38b0 t finish_task_switch.isra.0
ffffffffaa2ab721 t finish_task_switch.isra.0.cold
ffffffffc0d750d0 t bpf_prog_a4e3c0d94fadbd36_kprobe__finish_task_switch [bpf]

kepler didnt log the error (as I believe, it should) https://github.com/sustainable-computing-io/kepler/blob/main/pkg/bpfassets/attacher/bcc_attacher.go#L94

and finally.. attached or doesnt attached to the finish_task_switch?

tobby-yuan commented 1 year ago

Hi @jokuniew, this is my log of checking grep finish_task_switch /proc/kallsyms

ffffffff8aafbc40 t finish_task_switch.isra.0
ffffffff8b70bdb5 t finish_task_switch.isra.0.cold
ffffffffc0e987f0 t bpf_prog_a4e3c0d94fadbd36_kprobe__finish_task_switch [bpf]

This log is same as yours. So, how can I solve this promblem or why did this problem happen?

rootfs commented 1 year ago

can you try the libbpf image? Please change the kepler image from latest to latest-libbpf. You can do it by

kubectl edit -n kepler daemonset kepler-exporter

jichenjc commented 1 year ago

Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSET_GROUP_ID]

we saw this usually means the attach works? otherwise if the attach failed then we will not reach here ..

But my some metrics(e.g. kepler_container_cgroupfs_cpu_usage_us_total, kepler_container_core_joules_total and so on) is alway 0

if you set kepler to --v 5 it might print some info which should be helpful to know why 0 is reported , whether it's work as design or something really wrong

tobby-yuan commented 1 year ago

@jichenjc Do you mean that I should change loglevel of kepler to 5?

jichenjc commented 1 year ago

yes, change to 5 it will print some logs and should have more info about the every 3 seconds report of the metrics including cgroup and eBPF , it might give you additional info on why it's 0

jokuniew commented 1 year ago

can you try the libbpf image? Please change the kepler image from latest to latest-libbpf. You can do it by

kubectl edit -n kepler daemonset kepler-exporter

@rootfs I've changed the image ends up with panic

 kubectl logs -n kepler               kepler-nhl4b  
I0816 06:19:11.896894       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0816 06:19:11.930157       1 exporter.go:156] Kepler running on version: c2f4277
I0816 06:19:11.930175       1 config.go:263] using gCgroup ID in the BPF program: true
I0816 06:19:11.930205       1 config.go:265] kernel version: 5.15
I0816 06:19:11.930263       1 config.go:290] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0816 06:19:11.930269       1 exporter.go:181] EnabledBPFBatchDelete: true
I0816 06:19:11.930309       1 power.go:54] use sysfs to obtain power
I0816 06:19:11.930316       1 redfish.go:169] failed to get redfish credential file path
I0816 06:19:11.930321       1 power.go:56] use acpi to obtain power
I0816 06:19:11.937727       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0816 06:19:12.034630       1 container_energy.go:109] Using the Ratio/AbsPower Power Model to estimate Container Platform Power
I0816 06:19:12.034662       1 container_energy.go:118] Using the Ratio/AbsPower Power Model to estimate Container Component Power
I0816 06:19:12.034680       1 process_power.go:108] Using the Ratio/AbsPower Power Model to estimate Process Platform Power
I0816 06:19:12.034700       1 process_power.go:117] Using the Ratio/AbsPower Power Model to estimate Process Component Power
I0816 06:19:12.034999       1 node_platform_energy.go:52] Using the LinearRegressor/AbsModelWeight Power Model to estimate Node Platform Power
I0816 06:19:12.035093       1 exporter.go:204] Initializing the GPU collector
I0816 06:19:18.040743       1 watcher.go:66] Using in cluster k8s config
I0816 06:19:18.141471       1 bpf_perf.go:123] LibbpfBuilt: true, BccBuilt: false
libbpf: loading /var/lib/kepler/bpfassets/amd64_kepler.bpf.o
libbpf: elf: section(3) tracepoint/sched/sched_switch, size 2376, link 0, flags 6, type=1
libbpf: sec 'tracepoint/sched/sched_switch': found program 'kepler_trace' at insn offset 0 (0 bytes), code size 297 insns (2376 bytes)
libbpf: elf: section(4) .reltracepoint/sched/sched_switch, size 352, link 26, flags 40, type=9
libbpf: elf: section(5) tracepoint/irq/softirq_entry, size 144, link 0, flags 6, type=1
libbpf: sec 'tracepoint/irq/softirq_entry': found program 'kepler_irq_trace' at insn offset 0 (0 bytes), code size 18 insns (144 bytes)
libbpf: elf: section(6) .reltracepoint/irq/softirq_entry, size 16, link 26, flags 40, type=9
libbpf: elf: section(7) .maps, size 352, link 0, flags 3, type=1
libbpf: elf: section(8) license, size 4, link 0, flags 3, type=1
libbpf: license of /var/lib/kepler/bpfassets/amd64_kepler.bpf.o is GPL
libbpf: elf: section(17) .BTF, size 5838, link 0, flags 0, type=1
libbpf: elf: section(19) .BTF.ext, size 2072, link 0, flags 0, type=1
libbpf: elf: section(26) .symtab, size 984, link 1, flags 0, type=2
libbpf: looking for externs among 41 symbols...
libbpf: collected 0 externs total
libbpf: map 'processes': at sec_idx 7, offset 0.
libbpf: map 'processes': found type = 1.
libbpf: map 'processes': found key [6], sz = 4.
libbpf: map 'processes': found value [10], sz = 88.
libbpf: map 'processes': found max_entries = 10240.
libbpf: map 'pid_time': at sec_idx 7, offset 32.
libbpf: map 'pid_time': found type = 1.
libbpf: map 'pid_time': found key [6], sz = 4.
libbpf: map 'pid_time': found value [12], sz = 8.
libbpf: map 'pid_time': found max_entries = 10240.
libbpf: map 'cpu_cycles_hc_reader': at sec_idx 7, offset 64.
libbpf: map 'cpu_cycles_hc_reader': found type = 4.
libbpf: map 'cpu_cycles_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_cycles_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_cycles_hc_reader': found max_entries = 128.
libbpf: map 'cpu_cycles': at sec_idx 7, offset 96.
libbpf: map 'cpu_cycles': found type = 2.
libbpf: map 'cpu_cycles': found key [6], sz = 4.
libbpf: map 'cpu_cycles': found value [12], sz = 8.
libbpf: map 'cpu_cycles': found max_entries = 128.
libbpf: map 'cpu_ref_cycles_hc_reader': at sec_idx 7, offset 128.
libbpf: map 'cpu_ref_cycles_hc_reader': found type = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found max_entries = 128.
libbpf: map 'cpu_ref_cycles': at sec_idx 7, offset 160.
libbpf: map 'cpu_ref_cycles': found type = 2.
libbpf: map 'cpu_ref_cycles': found key [6], sz = 4.
libbpf: map 'cpu_ref_cycles': found value [12], sz = 8.
libbpf: map 'cpu_ref_cycles': found max_entries = 128.
libbpf: map 'cpu_instr_hc_reader': at sec_idx 7, offset 192.
libbpf: map 'cpu_instr_hc_reader': found type = 4.
libbpf: map 'cpu_instr_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_instr_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_instr_hc_reader': found max_entries = 128.
libbpf: map 'cpu_instr': at sec_idx 7, offset 224.
libbpf: map 'cpu_instr': found type = 2.
libbpf: map 'cpu_instr': found key [6], sz = 4.
libbpf: map 'cpu_instr': found value [12], sz = 8.
libbpf: map 'cpu_instr': found max_entries = 128.
libbpf: map 'cache_miss_hc_reader': at sec_idx 7, offset 256.
libbpf: map 'cache_miss_hc_reader': found type = 4.
libbpf: map 'cache_miss_hc_reader': found key [2], sz = 4.
libbpf: map 'cache_miss_hc_reader': found value [6], sz = 4.
libbpf: map 'cache_miss_hc_reader': found max_entries = 128.
libbpf: map 'cache_miss': at sec_idx 7, offset 288.
libbpf: map 'cache_miss': found type = 2.
libbpf: map 'cache_miss': found key [6], sz = 4.
libbpf: map 'cache_miss': found value [12], sz = 8.
libbpf: map 'cache_miss': found max_entries = 128.
libbpf: map 'cpu_freq_array': at sec_idx 7, offset 320.
libbpf: map 'cpu_freq_array': found type = 2.
libbpf: map 'cpu_freq_array': found key [6], sz = 4.
libbpf: map 'cpu_freq_array': found value [6], sz = 4.
libbpf: map 'cpu_freq_array': found max_entries = 128.
libbpf: sec '.reltracepoint/sched/sched_switch': collecting relocation for section(3) 'tracepoint/sched/sched_switch'
libbpf: sec '.reltracepoint/sched/sched_switch': relo #0: insn #18 against 'cpu_cycles_hc_reader'
libbpf: prog 'kepler_trace': found map 2 (cpu_cycles_hc_reader, sec 7, off 64) for insn #18
libbpf: sec '.reltracepoint/sched/sched_switch': relo #1: insn #37 against 'cpu_cycles'
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 7, off 96) for insn #37
libbpf: sec '.reltracepoint/sched/sched_switch': relo #2: insn #51 against 'cpu_cycles'
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 7, off 96) for insn #51
libbpf: sec '.reltracepoint/sched/sched_switch': relo #3: insn #56 against 'cpu_ref_cycles_hc_reader'
libbpf: prog 'kepler_trace': found map 4 (cpu_ref_cycles_hc_reader, sec 7, off 128) for insn #56
libbpf: sec '.reltracepoint/sched/sched_switch': relo #4: insn #69 against 'cpu_ref_cycles'
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 7, off 160) for insn #69
libbpf: sec '.reltracepoint/sched/sched_switch': relo #5: insn #83 against 'cpu_ref_cycles'
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 7, off 160) for insn #83
libbpf: sec '.reltracepoint/sched/sched_switch': relo #6: insn #88 against 'cpu_instr_hc_reader'
libbpf: prog 'kepler_trace': found map 6 (cpu_instr_hc_reader, sec 7, off 192) for insn #88
libbpf: sec '.reltracepoint/sched/sched_switch': relo #7: insn #105 against 'cpu_instr'
libbpf: prog 'kepler_trace': found map 7 (cpu_instr, sec 7, off 224) for insn #105
libbpf: sec '.reltracepoint/sched/sched_switch': relo #8: insn #118 against 'cpu_instr'
libbpf: prog 'kepler_trace': found map 7 (cpu_instr, sec 7, off 224) for insn #118
libbpf: sec '.reltracepoint/sched/sched_switch': relo #9: insn #123 against 'cache_miss_hc_reader'
libbpf: prog 'kepler_trace': found map 8 (cache_miss_hc_reader, sec 7, off 256) for insn #123
libbpf: sec '.reltracepoint/sched/sched_switch': relo #10: insn #135 against 'cache_miss'
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 7, off 288) for insn #135
libbpf: sec '.reltracepoint/sched/sched_switch': relo #11: insn #149 against 'cache_miss'
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 7, off 288) for insn #149
libbpf: sec '.reltracepoint/sched/sched_switch': relo #12: insn #157 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #157
libbpf: sec '.reltracepoint/sched/sched_switch': relo #13: insn #171 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #171
libbpf: sec '.reltracepoint/sched/sched_switch': relo #14: insn #183 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #183
libbpf: sec '.reltracepoint/sched/sched_switch': relo #15: insn #207 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #207
libbpf: sec '.reltracepoint/sched/sched_switch': relo #16: insn #216 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #216
libbpf: sec '.reltracepoint/sched/sched_switch': relo #17: insn #224 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #224
libbpf: sec '.reltracepoint/sched/sched_switch': relo #18: insn #236 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #236
libbpf: sec '.reltracepoint/sched/sched_switch': relo #19: insn #242 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #242
libbpf: sec '.reltracepoint/sched/sched_switch': relo #20: insn #264 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #264
libbpf: sec '.reltracepoint/sched/sched_switch': relo #21: insn #291 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #291
libbpf: sec '.reltracepoint/irq/softirq_entry': collecting relocation for section(5) 'tracepoint/irq/softirq_entry'
libbpf: sec '.reltracepoint/irq/softirq_entry': relo #0: insn #5 against 'processes'
libbpf: prog 'kepler_irq_trace': found map 0 (processes, sec 7, off 0) for insn #5
libbpf: map 'processes': created successfully, fd=10
libbpf: map 'pid_time': created successfully, fd=11
libbpf: map 'cpu_cycles_hc_reader': created successfully, fd=12
libbpf: map 'cpu_cycles': created successfully, fd=13
libbpf: map 'cpu_ref_cycles_hc_reader': created successfully, fd=14
libbpf: map 'cpu_ref_cycles': created successfully, fd=15
libbpf: map 'cpu_instr_hc_reader': created successfully, fd=16
libbpf: map 'cpu_instr': created successfully, fd=17
libbpf: map 'cache_miss_hc_reader': created successfully, fd=18
libbpf: map 'cache_miss': created successfully, fd=19
libbpf: map 'cpu_freq_array': created successfully, fd=20
I0816 06:19:18.242250       1 libbpf_attacher.go:153] Successfully load eBPF module from libbpf object
I0816 06:19:18.265889       1 exporter.go:257] Started Kepler in 6.335731883s
panic: runtime error: index out of range [0] with length 0

goroutine 162 [running]:
github.com/sustainable-computing-io/kepler/pkg/model.addProcessEstimatedEnergy({0xc000a80000, 0x1c4, 0xc00062a340?}, 0x0?, 0x1)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/model/process_power.go:270 +0x1030
github.com/sustainable-computing-io/kepler/pkg/model.UpdateProcessEnergy(0xc0003e8100?, 0xc00044660e?)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/model/process_power.go:138 +0x145
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateProcessEnergy(...)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/collector/process_energy_collector.go:25
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Update(0xc0003e8100)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/collector/metric_collector.go:120 +0x10b
github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start.func1()
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/manager/manager.go:72 +0x7b
created by github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/manager/manager.go:64 +0x6a
jokuniew commented 1 year ago

I need to add, that previously I've deployed kepler via helm chart. All reported errors are from helm deployment

When I deployed kepler with manifests ( make & apply) I've got right log message and metrics are valid

I0816 09:24:28.557873       1 watcher.go:66] Using in cluster k8s config
I0816 09:24:28.658788       1 bpf_perf.go:123] LibbpfBuilt: false, BccBuilt: true
cannot attach kprobe, probe entry may not exist
I0816 09:24:29.503997       1 bcc_attacher.go:94] attaching kprobe to finish_task_switch failed, trying finish_task_switch.isra.0 instead
I0816 09:24:29.522823       1 bcc_attacher.go:183] Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSET_GROUP_ID]
I0816 09:24:29.551285       1 exporter.go:257] Started Kepler in 7.069365222s
tobby-yuan commented 1 year ago

yes, change to 5 it will print some logs and should have more info about the every 3 seconds report of the metrics including cgroup and eBPF , it might give you additional info on why it's 0 @jichenjc @rootfs


$ kubectl logs -n kepler        kepler-exporter-rjmb2 kepler-exporter
I0818 07:18:32.850191       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0818 07:18:32.855273       1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0818 07:18:32.865892       1 exporter.go:158] Kepler running on version: b57ffa3
I0818 07:18:32.865922       1 config.go:267] using gCgroup ID in the BPF program: true
I0818 07:18:32.865980       1 config.go:269] kernel version: 5.15
I0818 07:18:32.867090       1 exporter.go:184] EnabledBPFBatchDelete: true
I0818 07:18:32.867110       1 config.go:140] ENABLE_EBPF_CGROUPID: true
I0818 07:18:32.867113       1 config.go:141] ENABLE_GPU: true
I0818 07:18:32.867159       1 config.go:142] ENABLE_QAT: false
I0818 07:18:32.867162       1 config.go:143] ENABLE_PROCESS_METRICS: false
I0818 07:18:32.867165       1 config.go:144] EXPOSE_HW_COUNTER_METRICS: false
I0818 07:18:32.867180       1 config.go:145] EXPOSE_CGROUP_METRICS: true
I0818 07:18:32.867183       1 config.go:146] EXPOSE_KUBELET_METRICS: true
I0818 07:18:32.867185       1 config.go:147] EXPOSE_IRQ_COUNTER_METRICS: true
I0818 07:18:32.867188       1 config.go:148] EXPOSE_ESTIMATED_IDLE_POWER_METRICS: false. This only impacts when the power is estimated using pre-prained models. Estimated idle power is meaningful only when Kepler is running on bare-metal or with a single virtual machine (VM) on the node.
I0818 07:18:32.867321       1 rapl_msr_util.go:129] failed to open path /dev/cpu/0/msr: no such file or directory
I0818 07:18:32.867436       1 power.go:71] Unable to obtain power, use estimate method
I0818 07:18:32.867557       1 redfish.go:169] failed to get redfish credential file path
I0818 07:18:32.867761       1 power.go:56] use acpi to obtain power
I0818 07:18:32.869199       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0818 07:18:32.869256       1 bpf_perf.go:81] hardeware counter metrics config false
I0818 07:18:32.869294       1 bpf_perf.go:83] hardeware counter metrics not enabled
I0818 07:18:32.869339       1 bpf_perf.go:99] irq counter metrics config true
I0818 07:18:32.892403       1 utils.go:56] Available ebpf metrics: [bpf_cpu_time_us bpf_net_tx_irq bpf_net_rx_irq bpf_block_irq]
I0818 07:18:32.892440       1 utils.go:57] Available counter metrics: []
I0818 07:18:32.892443       1 utils.go:58] Available cgroup metrics from cgroup: [cgroupfs_memory_usage_bytes cgroupfs_kernel_memory_usage_bytes cgroupfs_tcp_memory_usage_bytes cgroupfs_cpu_usage_us cgroupfs_system_cpu_usage_us cgroupfs_user_cpu_usage_us cgroupfs_ioread_bytes cgroupfs_iowrite_bytes block_devices_used]
I0818 07:18:32.892466       1 utils.go:59] Available cgroup metrics from kubelet: [kubelet_cpu_usage kubelet_memory_bytes]
I0818 07:18:32.892550       1 model.go:174] Model Config CONTAINER_TOTAL: {ModelType:Ratio ModelOutputType:AbsModelWeight TrainerName: EnergySource:acpi SelectFilter: InitModelURL: IsNodePowerModel:false ContainerFeatureNames:[] NodeFeatureNames:[] SystemMetaDataFeatureNames:[] SystemMetaDataFeatureValues:[]}
I0818 07:18:32.892581       1 model.go:94] Using Power Model Ratio
I0818 07:18:32.892591       1 container_energy.go:109] Using the Ratio/AbsModelWeight Power Model to estimate Container Platform Power
I0818 07:18:32.892597       1 model.go:174] Model Config CONTAINER_COMPONENTS: {ModelType:EstimatorSidecar ModelOutputType:AbsModelWeight TrainerName: EnergySource:rapl SelectFilter: InitModelURL:https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CgroupOnly/ScikitMixed.zip IsNodePowerModel:false ContainerFeatureNames:[] NodeFeatureNames:[] SystemMetaDataFeatureNames:[] SystemMetaDataFeatureValues:[]}
I0818 07:18:32.894319       1 estimate.go:140] estimator unmarshal error: json: cannot unmarshal array into Go struct field ComponentPowerResponse.powers of type map[string][]float64 ({"powers": [], "msg": "fail to handle request: __init__() got an unexpected keyword argument 'source'"})
I0818 07:18:32.894378       1 container_energy.go:120] Failed to create EstimatorSidecar/AbsModelWeight Power Model to estimate Container Component Power: json: cannot unmarshal array into Go struct field ComponentPowerResponse.powers of type map[string][]float64
I0818 07:18:32.894401       1 model.go:174] Model Config PROCESS_TOTAL: {ModelType:Ratio ModelOutputType:AbsModelWeight TrainerName: EnergySource:acpi SelectFilter: InitModelURL: IsNodePowerModel:false ContainerFeatureNames:[] NodeFeatureNames:[] SystemMetaDataFeatureNames:[] SystemMetaDataFeatureValues:[]}
I0818 07:18:32.894420       1 model.go:94] Using Power Model Ratio
I0818 07:18:32.894429       1 process_power.go:108] Using the Ratio/AbsModelWeight Power Model to estimate Process Platform Power
I0818 07:18:32.894436       1 model.go:174] Model Config PROCESS_COMPONENTS: {ModelType:Ratio ModelOutputType:AbsModelWeight TrainerName: EnergySource:rapl SelectFilter: InitModelURL: IsNodePowerModel:false ContainerFeatureNames:[] NodeFeatureNames:[] SystemMetaDataFeatureNames:[] SystemMetaDataFeatureValues:[]}
I0818 07:18:32.894447       1 model.go:94] Using Power Model Ratio
I0818 07:18:32.894452       1 process_power.go:117] Using the Ratio/AbsModelWeight Power Model to estimate Process Component Power
I0818 07:18:32.894462       1 model.go:174] Model Config NODE_TOTAL: {ModelType:LinearRegressor ModelOutputType:AbsPower TrainerName: EnergySource:acpi SelectFilter: InitModelURL: IsNodePowerModel:false ContainerFeatureNames:[] NodeFeatureNames:[] SystemMetaDataFeatureNames:[] SystemMetaDataFeatureValues:[]}
I0818 07:18:32.894685       1 lr.go:187] LR Model (AbsPower): loadWeightFromURLorLocal(/var/lib/kepler/data/AbsPowerModel.json): &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}] (error: <nil>)
I0818 07:18:32.894773       1 model.go:119] Using Power Model AbsPower
I0818 07:18:32.894778       1 node_platform_energy.go:52] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power
I0818 07:18:32.894786       1 model.go:174] Model Config NODE_COMPONENTS: {ModelType:LinearRegressor ModelOutputType:AbsPower TrainerName: EnergySource:rapl SelectFilter: InitModelURL: IsNodePowerModel:false ContainerFeatureNames:[] NodeFeatureNames:[] SystemMetaDataFeatureNames:[] SystemMetaDataFeatureValues:[]}
I0818 07:18:32.894867       1 lr.go:187] LR Model (AbsPower): loadWeightFromURLorLocal(/var/lib/kepler/data/AbsPowerModel.json): &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}] (error: <nil>)
I0818 07:18:32.894891       1 model.go:119] Using Power Model AbsPower
I0818 07:18:32.894896       1 node_component_energy.go:56] Using the LinearRegressor/AbsPower Power Model to estimate Node Component Power
I0818 07:18:32.894908       1 exporter.go:207] Initializing the GPU collector
I0818 07:18:38.899012       1 watcher.go:66] Using in cluster k8s config
I0818 07:18:38.900148       1 reflector.go:221] Starting reflector <unspecified> (0s) from github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:122
I0818 07:18:38.900216       1 reflector.go:257] Listing and watching <unspecified> from github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:122
I0818 07:18:38.999871       1 shared_informer.go:285] caches populated
I0818 07:18:39.000158       1 bpf_perf.go:123] LibbpfBuilt: true, BccBuilt: false
libbpf: loading /var/lib/kepler/bpfassets/amd64_kepler.bpf.o
libbpf: elf: section(3) tracepoint/sched/sched_switch, size 2344, link 0, flags 6, type=1
libbpf: sec 'tracepoint/sched/sched_switch': found program 'kepler_trace' at insn offset 0 (0 bytes), code size 293 insns (2344 bytes)
libbpf: elf: section(4) .reltracepoint/sched/sched_switch, size 352, link 29, flags 40, type=9
libbpf: elf: section(5) tracepoint/irq/softirq_entry, size 144, link 0, flags 6, type=1
libbpf: sec 'tracepoint/irq/softirq_entry': found program 'kepler_irq_trace' at insn offset 0 (0 bytes), code size 18 insns (144 bytes)
libbpf: elf: section(6) .reltracepoint/irq/softirq_entry, size 16, link 29, flags 40, type=9
libbpf: elf: section(7) .maps, size 352, link 0, flags 3, type=1
libbpf: elf: section(8) license, size 4, link 0, flags 3, type=1
libbpf: license of /var/lib/kepler/bpfassets/amd64_kepler.bpf.o is GPL
libbpf: elf: section(19) .BTF, size 5759, link 0, flags 0, type=1
libbpf: elf: section(21) .BTF.ext, size 2120, link 0, flags 0, type=1
libbpf: elf: section(29) .symtab, size 1056, link 1, flags 0, type=2
libbpf: looking for externs among 44 symbols...
libbpf: collected 0 externs total
libbpf: map 'processes': at sec_idx 7, offset 0.
libbpf: map 'processes': found type = 1.
libbpf: map 'processes': found key [6], sz = 4.
libbpf: map 'processes': found value [10], sz = 88.
libbpf: map 'processes': found max_entries = 32768.
libbpf: map 'pid_time': at sec_idx 7, offset 32.
libbpf: map 'pid_time': found type = 1.
libbpf: map 'pid_time': found key [6], sz = 4.
libbpf: map 'pid_time': found value [12], sz = 8.
libbpf: map 'pid_time': found max_entries = 32768.
libbpf: map 'cpu_cycles_hc_reader': at sec_idx 7, offset 64.
libbpf: map 'cpu_cycles_hc_reader': found type = 4.
libbpf: map 'cpu_cycles_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_cycles_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_cycles_hc_reader': found max_entries = 128.
libbpf: map 'cpu_cycles': at sec_idx 7, offset 96.
libbpf: map 'cpu_cycles': found type = 2.
libbpf: map 'cpu_cycles': found key [6], sz = 4.
libbpf: map 'cpu_cycles': found value [12], sz = 8.
libbpf: map 'cpu_cycles': found max_entries = 128.
libbpf: map 'cpu_ref_cycles_hc_reader': at sec_idx 7, offset 128.
libbpf: map 'cpu_ref_cycles_hc_reader': found type = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found max_entries = 128.
libbpf: map 'cpu_ref_cycles': at sec_idx 7, offset 160.
libbpf: map 'cpu_ref_cycles': found type = 2.
libbpf: map 'cpu_ref_cycles': found key [6], sz = 4.
libbpf: map 'cpu_ref_cycles': found value [12], sz = 8.
libbpf: map 'cpu_ref_cycles': found max_entries = 128.
libbpf: map 'cpu_instr_hc_reader': at sec_idx 7, offset 192.
libbpf: map 'cpu_instr_hc_reader': found type = 4.
libbpf: map 'cpu_instr_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_instr_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_instr_hc_reader': found max_entries = 128.
libbpf: map 'cpu_instr': at sec_idx 7, offset 224.
libbpf: map 'cpu_instr': found type = 2.
libbpf: map 'cpu_instr': found key [6], sz = 4.
libbpf: map 'cpu_instr': found value [12], sz = 8.
libbpf: map 'cpu_instr': found max_entries = 128.
libbpf: map 'cache_miss_hc_reader': at sec_idx 7, offset 256.
libbpf: map 'cache_miss_hc_reader': found type = 4.
libbpf: map 'cache_miss_hc_reader': found key [2], sz = 4.
libbpf: map 'cache_miss_hc_reader': found value [6], sz = 4.
libbpf: map 'cache_miss_hc_reader': found max_entries = 128.
libbpf: map 'cache_miss': at sec_idx 7, offset 288.
libbpf: map 'cache_miss': found type = 2.
libbpf: map 'cache_miss': found key [6], sz = 4.
libbpf: map 'cache_miss': found value [12], sz = 8.
libbpf: map 'cache_miss': found max_entries = 128.
libbpf: map 'cpu_freq_array': at sec_idx 7, offset 320.
libbpf: map 'cpu_freq_array': found type = 2.
libbpf: map 'cpu_freq_array': found key [6], sz = 4.
libbpf: map 'cpu_freq_array': found value [6], sz = 4.
libbpf: map 'cpu_freq_array': found max_entries = 128.
libbpf: sec '.reltracepoint/sched/sched_switch': collecting relocation for section(3) 'tracepoint/sched/sched_switch'
libbpf: sec '.reltracepoint/sched/sched_switch': relo #0: insn #17 against 'cpu_cycles_hc_reader'
libbpf: prog 'kepler_trace': found map 2 (cpu_cycles_hc_reader, sec 7, off 64) for insn #17
libbpf: sec '.reltracepoint/sched/sched_switch': relo #1: insn #36 against 'cpu_cycles'
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 7, off 96) for insn #36
libbpf: sec '.reltracepoint/sched/sched_switch': relo #2: insn #50 against 'cpu_cycles'
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 7, off 96) for insn #50
libbpf: sec '.reltracepoint/sched/sched_switch': relo #3: insn #55 against 'cpu_ref_cycles_hc_reader'
libbpf: prog 'kepler_trace': found map 4 (cpu_ref_cycles_hc_reader, sec 7, off 128) for insn #55
libbpf: sec '.reltracepoint/sched/sched_switch': relo #4: insn #68 against 'cpu_ref_cycles'
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 7, off 160) for insn #68
libbpf: sec '.reltracepoint/sched/sched_switch': relo #5: insn #82 against 'cpu_ref_cycles'
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 7, off 160) for insn #82
libbpf: sec '.reltracepoint/sched/sched_switch': relo #6: insn #87 against 'cpu_instr_hc_reader'
libbpf: prog 'kepler_trace': found map 6 (cpu_instr_hc_reader, sec 7, off 192) for insn #87
libbpf: sec '.reltracepoint/sched/sched_switch': relo #7: insn #104 against 'cpu_instr'
libbpf: prog 'kepler_trace': found map 7 (cpu_instr, sec 7, off 224) for insn #104
libbpf: sec '.reltracepoint/sched/sched_switch': relo #8: insn #117 against 'cpu_instr'
libbpf: prog 'kepler_trace': found map 7 (cpu_instr, sec 7, off 224) for insn #117
libbpf: sec '.reltracepoint/sched/sched_switch': relo #9: insn #122 against 'cache_miss_hc_reader'
libbpf: prog 'kepler_trace': found map 8 (cache_miss_hc_reader, sec 7, off 256) for insn #122
libbpf: sec '.reltracepoint/sched/sched_switch': relo #10: insn #134 against 'cache_miss'
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 7, off 288) for insn #134
libbpf: sec '.reltracepoint/sched/sched_switch': relo #11: insn #148 against 'cache_miss'
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 7, off 288) for insn #148
libbpf: sec '.reltracepoint/sched/sched_switch': relo #12: insn #156 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #156
libbpf: sec '.reltracepoint/sched/sched_switch': relo #13: insn #170 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #170
libbpf: sec '.reltracepoint/sched/sched_switch': relo #14: insn #182 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #182
libbpf: sec '.reltracepoint/sched/sched_switch': relo #15: insn #206 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #206
libbpf: sec '.reltracepoint/sched/sched_switch': relo #16: insn #215 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #215
libbpf: sec '.reltracepoint/sched/sched_switch': relo #17: insn #223 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #223
libbpf: sec '.reltracepoint/sched/sched_switch': relo #18: insn #235 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #235
libbpf: sec '.reltracepoint/sched/sched_switch': relo #19: insn #241 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #241
libbpf: sec '.reltracepoint/sched/sched_switch': relo #20: insn #261 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #261
libbpf: sec '.reltracepoint/sched/sched_switch': relo #21: insn #287 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #287
libbpf: sec '.reltracepoint/irq/softirq_entry': collecting relocation for section(5) 'tracepoint/irq/softirq_entry'
libbpf: sec '.reltracepoint/irq/softirq_entry': relo #0: insn #5 against 'processes'
libbpf: prog 'kepler_irq_trace': found map 0 (processes, sec 7, off 0) for insn #5
libbpf: map 'processes': created successfully, fd=10
libbpf: map 'pid_time': created successfully, fd=11
libbpf: map 'cpu_cycles_hc_reader': created successfully, fd=12
libbpf: map 'cpu_cycles': created successfully, fd=13
libbpf: map 'cpu_ref_cycles_hc_reader': created successfully, fd=14
libbpf: map 'cpu_ref_cycles': created successfully, fd=15
libbpf: map 'cpu_instr_hc_reader': created successfully, fd=16
libbpf: map 'cpu_instr': created successfully, fd=17
libbpf: map 'cache_miss_hc_reader': created successfully, fd=18
libbpf: map 'cache_miss': created successfully, fd=19
libbpf: map 'cpu_freq_array': created successfully, fd=20
I0818 07:18:39.009575       1 libbpf_attacher.go:143] failed to get perf event cpu_instructions_hc_reader: failed to find BPF map cpu_instructions_hc_reader: no such file or directory
I0818 07:18:39.009957       1 libbpf_attacher.go:157] Successfully load eBPF module from libbpf object
I0818 07:18:39.033896       1 node_metric.go:274] Unknown node feature: bpf_cpu_time_us, adding 0 value
I0818 07:18:39.033923       1 node_metric.go:274] Unknown node feature: bpf_net_tx_irq, adding 0 value
I0818 07:18:39.033929       1 node_metric.go:274] Unknown node feature: bpf_net_rx_irq, adding 0 value
I0818 07:18:39.033934       1 node_metric.go:274] Unknown node feature: bpf_block_irq, adding 0 value
I0818 07:18:39.033937       1 node_metric.go:274] Unknown node feature: cgroupfs_memory_usage_bytes, adding 0 value
I0818 07:18:39.033989       1 node_metric.go:274] Unknown node feature: cgroupfs_kernel_memory_usage_bytes, adding 0 value
I0818 07:18:39.033998       1 node_metric.go:274] Unknown node feature: cgroupfs_tcp_memory_usage_bytes, adding 0 value
I0818 07:18:39.034002       1 node_metric.go:274] Unknown node feature: cgroupfs_cpu_usage_us, adding 0 value
I0818 07:18:39.034005       1 node_metric.go:274] Unknown node feature: cgroupfs_system_cpu_usage_us, adding 0 value
I0818 07:18:39.034010       1 node_metric.go:274] Unknown node feature: cgroupfs_user_cpu_usage_us, adding 0 value
I0818 07:18:39.034013       1 node_metric.go:274] Unknown node feature: cgroupfs_ioread_bytes, adding 0 value
I0818 07:18:39.034017       1 node_metric.go:274] Unknown node feature: cgroupfs_iowrite_bytes, adding 0 value
I0818 07:18:39.034023       1 node_metric.go:274] Unknown node feature: block_devices_used, adding 0 value
I0818 07:18:39.034027       1 node_metric.go:274] Unknown node feature: kubelet_cpu_usage, adding 0 value
I0818 07:18:39.034030       1 node_metric.go:274] Unknown node feature: kubelet_memory_bytes, adding 0 value
I0818 07:18:39.034109       1 node_metric.go:274] Unknown node feature: block_devices_used, adding 0 value
I0818 07:18:39.034242       1 node_metric.go:274] Unknown node feature: bpf_cpu_time_us, adding 0 value
I0818 07:18:39.034247       1 node_metric.go:274] Unknown node feature: bpf_net_tx_irq, adding 0 value
I0818 07:18:39.034251       1 node_metric.go:274] Unknown node feature: bpf_net_rx_irq, adding 0 value
I0818 07:18:39.034254       1 node_metric.go:274] Unknown node feature: bpf_block_irq, adding 0 value
I0818 07:18:39.034257       1 node_metric.go:274] Unknown node feature: cgroupfs_memory_usage_bytes, adding 0 value
I0818 07:18:39.034261       1 node_metric.go:274] Unknown node feature: cgroupfs_kernel_memory_usage_bytes, adding 0 value
I0818 07:18:39.034264       1 node_metric.go:274] Unknown node feature: cgroupfs_tcp_memory_usage_bytes, adding 0 value
I0818 07:18:39.034267       1 node_metric.go:274] Unknown node feature: cgroupfs_cpu_usage_us, adding 0 value
I0818 07:18:39.034270       1 node_metric.go:274] Unknown node feature: cgroupfs_system_cpu_usage_us, adding 0 value
I0818 07:18:39.034275       1 node_metric.go:274] Unknown node feature: cgroupfs_user_cpu_usage_us, adding 0 value
I0818 07:18:39.034314       1 node_metric.go:274] Unknown node feature: cgroupfs_ioread_bytes, adding 0 value
I0818 07:18:39.034318       1 node_metric.go:274] Unknown node feature: cgroupfs_iowrite_bytes, adding 0 value
I0818 07:18:39.034322       1 node_metric.go:274] Unknown node feature: block_devices_used, adding 0 value
I0818 07:18:39.034326       1 node_metric.go:274] Unknown node feature: kubelet_cpu_usage, adding 0 value
I0818 07:18:39.034331       1 node_metric.go:274] Unknown node feature: kubelet_memory_bytes, adding 0 value
I0818 07:18:39.034337       1 node_metric.go:274] Unknown node feature: block_devices_used, adding 0 value
I0818 07:18:39.034412       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0818 07:18:39.034426       1 node_metric.go:274] Unknown node feature: bpf_cpu_time_us, adding 0 value
I0818 07:18:39.034430       1 node_metric.go:274] Unknown node feature: bpf_net_tx_irq, adding 0 value
I0818 07:18:39.034434       1 node_metric.go:274] Unknown node feature: bpf_net_rx_irq, adding 0 value
I0818 07:18:39.034437       1 node_metric.go:274] Unknown node feature: bpf_block_irq, adding 0 value
I0818 07:18:39.034440       1 node_metric.go:274] Unknown node feature: cgroupfs_memory_usage_bytes, adding 0 value
I0818 07:18:39.034444       1 node_metric.go:274] Unknown node feature: cgroupfs_kernel_memory_usage_bytes, adding 0 value
I0818 07:18:39.034450       1 node_metric.go:274] Unknown node feature: cgroupfs_tcp_memory_usage_bytes, adding 0 value
I0818 07:18:39.034453       1 node_metric.go:274] Unknown node feature: cgroupfs_cpu_usage_us, adding 0 value
I0818 07:18:39.034456       1 node_metric.go:274] Unknown node feature: cgroupfs_system_cpu_usage_us, adding 0 value
I0818 07:18:39.034460       1 node_metric.go:274] Unknown node feature: cgroupfs_user_cpu_usage_us, adding 0 value
I0818 07:18:39.034466       1 node_metric.go:274] Unknown node feature: cgroupfs_ioread_bytes, adding 0 value
I0818 07:18:39.034473       1 node_metric.go:274] Unknown node feature: cgroupfs_iowrite_bytes, adding 0 value
I0818 07:18:39.034476       1 node_metric.go:274] Unknown node feature: block_devices_used, adding 0 value
I0818 07:18:39.034479       1 node_metric.go:274] Unknown node feature: kubelet_cpu_usage, adding 0 value
I0818 07:18:39.034482       1 node_metric.go:274] Unknown node feature: kubelet_memory_bytes, adding 0 value
I0818 07:18:39.034486       1 node_metric.go:274] Unknown node feature: block_devices_used, adding 0 value
I0818 07:18:39.034509       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0818 07:18:39.034535       1 node_metric.go:274] Unknown node feature: bpf_cpu_time_us, adding 0 value
I0818 07:18:39.034539       1 node_metric.go:274] Unknown node feature: bpf_net_tx_irq, adding 0 value
I0818 07:18:39.034543       1 node_metric.go:274] Unknown node feature: bpf_net_rx_irq, adding 0 value
I0818 07:18:39.034547       1 node_metric.go:274] Unknown node feature: bpf_block_irq, adding 0 value
I0818 07:18:39.034550       1 node_metric.go:274] Unknown node feature: cgroupfs_memory_usage_bytes, adding 0 value
I0818 07:18:39.034554       1 node_metric.go:274] Unknown node feature: cgroupfs_kernel_memory_usage_bytes, adding 0 value
I0818 07:18:39.034558       1 node_metric.go:274] Unknown node feature: cgroupfs_tcp_memory_usage_bytes, adding 0 value
I0818 07:18:39.034565       1 node_metric.go:274] Unknown node feature: cgroupfs_cpu_usage_us, adding 0 value
I0818 07:18:39.034573       1 node_metric.go:274] Unknown node feature: cgroupfs_system_cpu_usage_us, adding 0 value
I0818 07:18:39.034576       1 node_metric.go:274] Unknown node feature: cgroupfs_user_cpu_usage_us, adding 0 value
I0818 07:18:39.034579       1 node_metric.go:274] Unknown node feature: cgroupfs_ioread_bytes, adding 0 value
I0818 07:18:39.034582       1 node_metric.go:274] Unknown node feature: cgroupfs_iowrite_bytes, adding 0 value
I0818 07:18:39.034586       1 node_metric.go:274] Unknown node feature: block_devices_used, adding 0 value
I0818 07:18:39.034590       1 node_metric.go:274] Unknown node feature: kubelet_cpu_usage, adding 0 value
I0818 07:18:39.034594       1 node_metric.go:274] Unknown node feature: kubelet_memory_bytes, adding 0 value
I0818 07:18:39.034598       1 node_metric.go:274] Unknown node feature: block_devices_used, adding 0 value
I0818 07:18:39.034617       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0818 07:18:39.035036       1 exporter.go:270] Started Kepler in 6.169258769s
I0818 07:18:42.035822       1 libbpf_attacher.go:344] successfully get data with batch get and delete with 337 pids in 669.77µs
I0818 07:18:42.049556       1 container_hc_collector.go:104] failed to resolve container for cGroup ID 15959 (command=systemd): process is not in a kubernetes pod, set containerID=system_processes
I0818 07:18:42.049900       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.050503       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.050623       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.050839       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.051015       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.051177       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.051330       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.051484       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.051626       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.051765       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.051920       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.052083       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.052223       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.052360       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.052511       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.052652       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.052792       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.052932       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.053071       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.053079       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.053217       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.053348       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.053486       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.053635       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.053802       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.053941       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.054078       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.054214       1 container_cgroup_collector.go:34] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0818 07:18:42.069870       1 container_cgroup_collector.go:64] Kubelet Read: map[kepler/kepler-exporter-rjmb2/estimator:2.411985 kepler/kepler-exporter-rjmb2/kepler-exporter:0.023722 kube-system/coredns-558bd4d5db-c94m2/coredns:1455.313128 kube-system/coredns-558bd4d5db-jmnkg/coredns:1418.22282 kube-system/etcd-understudent/etcd:7976.54621 kube-system/kube-apiserver-understudent/kube-apiserver:28524.042649 kube-system/kube-controller-manager-understudent/kube-controller-manager:8798.010774 kube-system/kube-flannel-ds-8r9tk/kube-flannel:1163.763703 kube-system/kube-proxy-qv7xd/kube-proxy:3099.603793 kube-system/kube-scheduler-understudent/kube-scheduler:1959.886663 ricinfra/deployment-tiller-ricxapp-6ff54cb9c-lj52v/tiller:34.630531 ricinfra/nfs-release-1-nfs-server-provisioner-0/nfs-server-provisioner:441.318558 ricinfra/r4-chartmuseum-chartmuseum-84477884b6-j5xfr/chartmuseum:46.565242 ricplt/deployment-ricplt-a1mediator-657fcf7d86-qshwb/container-ricplt-a1mediator:127.657006 ricplt/deployment-ricplt-alarmmanager-876fdcb49-62pkl/container-ricplt-alarmmanager:91.867715 ricplt/deployment-ricplt-appmgr-56c6d6b7f7-9ngnz/container-ricplt-appmgr:10.084942 ricplt/deployment-ricplt-e2mgr-7b4cf44d-2c22p/container-ricplt-e2mgr:108.563311 ricplt/deployment-ricplt-e2term-alpha-659f47f757-wzqmg/container-ricplt-e2term:353.457684 ricplt/deployment-ricplt-jaegeradapter-5bf9b64956-x9rrk/container-ricplt-jaegeradapter:98.520751 ricplt/deployment-ricplt-o1mediator-6b76c787f9-4wtkm/container-ricplt-o1mediator:359.166425 ricplt/deployment-ricplt-rtmgr-76dbf8ccff-9kvxl/container-ricplt-rtmgr:87.989749 ricplt/deployment-ricplt-submgr-6c57cd586-qptdk/container-ricplt-submgr:114.553443 ricplt/deployment-ricplt-vespamgr-556b9988b5-5ggq4/container-ricplt-vespamgr:28154.097207 ricplt/r4-influxdb-influxdb2-0/influxdb2:88.704533 ricplt/r4-infrastructure-kong-5b7cdc9dbc-d2clk/ingress-controller:278.699412 ricplt/r4-infrastructure-kong-5b7cdc9dbc-d2clk/proxy:38.79724 ricplt/r4-infrastructure-prometheus-alertmanager-7cc48c5988-6jd7q/prometheus-alertmanager:217.425073 ricplt/r4-infrastructure-prometheus-alertmanager-7cc48c5988-6jd7q/prometheus-alertmanager-configmap-reload:3.404702 ricplt/r4-infrastructure-prometheus-server-7f74bdfc6d-jtbb6/prometheus-server:4711.465415 ricplt/statefulset-ricplt-dbaas-server-0/container-ricplt-dbaas-redis:1017.609796 system/system_processes:105563.09219], map[kepler/kepler-exporter-rjmb2/estimator:1.74624768e+08 kepler/kepler-exporter-rjmb2/kepler-exporter:3.088384e+06 kube-system/coredns-558bd4d5db-c94m2/coredns:1.5798272e+07 kube-system/coredns-558bd4d5db-jmnkg/coredns:1.5896576e+07 kube-system/etcd-understudent/etcd:5.8130432e+07 kube-system/kube-apiserver-understudent/kube-apiserver:3.80891136e+08 kube-system/kube-controller-manager-understudent/kube-controller-manager:5.8273792e+07 kube-system/kube-flannel-ds-8r9tk/kube-flannel:1.2333056e+07 kube-system/kube-proxy-qv7xd/kube-proxy:2.08896e+07 kube-system/kube-scheduler-understudent/kube-scheduler:2.24256e+07 ricinfra/deployment-tiller-ricxapp-6ff54cb9c-lj52v/tiller:7.319552e+06 ricinfra/nfs-release-1-nfs-server-provisioner-0/nfs-server-provisioner:1.175552e+08 ricinfra/r4-chartmuseum-chartmuseum-84477884b6-j5xfr/chartmuseum:8.884224e+06 ricplt/deployment-ricplt-a1mediator-657fcf7d86-qshwb/container-ricplt-a1mediator:3.9800832e+07 ricplt/deployment-ricplt-alarmmanager-876fdcb49-62pkl/container-ricplt-alarmmanager:1.0735616e+07 ricplt/deployment-ricplt-appmgr-56c6d6b7f7-9ngnz/container-ricplt-appmgr:9.162752e+06 ricplt/deployment-ricplt-e2mgr-7b4cf44d-2c22p/container-ricplt-e2mgr:7.430144e+06 ricplt/deployment-ricplt-e2term-alpha-659f47f757-wzqmg/container-ricplt-e2term:2.0590592e+07 ricplt/deployment-ricplt-jaegeradapter-5bf9b64956-x9rrk/container-ricplt-jaegeradapter:7.254016e+06 ricplt/deployment-ricplt-o1mediator-6b76c787f9-4wtkm/container-ricplt-o1mediator:3.4975744e+07 ricplt/deployment-ricplt-rtmgr-76dbf8ccff-9kvxl/container-ricplt-rtmgr:1.2763136e+07 ricplt/deployment-ricplt-submgr-6c57cd586-qptdk/container-ricplt-submgr:1.089536e+07 ricplt/deployment-ricplt-vespamgr-556b9988b5-5ggq4/container-ricplt-vespamgr:1.16887552e+08 ricplt/r4-influxdb-influxdb2-0/influxdb2:5.193728e+07 ricplt/r4-infrastructure-kong-5b7cdc9dbc-d2clk/ingress-controller:1.2939264e+07 ricplt/r4-infrastructure-kong-5b7cdc9dbc-d2clk/proxy:5.47414016e+08 ricplt/r4-infrastructure-prometheus-alertmanager-7cc48c5988-6jd7q/prometheus-alertmanager:9.572352e+06 ricplt/r4-infrastructure-prometheus-alertmanager-7cc48c5988-6jd7q/prometheus-alertmanager-configmap-reload:1.814528e+06 ricplt/r4-infrastructure-prometheus-server-7f74bdfc6d-jtbb6/prometheus-server:1.90717952e+08 ricplt/statefulset-ricplt-dbaas-server-0/container-ricplt-dbaas-redis:2.297856e+06 system/system_processes:-1.983299584e+09]
I0818 07:18:42.070664       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0818 07:18:42.070806       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0818 07:18:42.070878       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
E0818 07:18:42.071058       1 container_energy.go:130] Container Component Power Model was not created
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x68 pc=0xd173d1]

goroutine 9 [running]:
github.com/sustainable-computing-io/kepler/pkg/model.UpdateContainerEnergy(0xc000304900?, 0xe?)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/model/container_energy.go:134 +0x111
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateContainerEnergy(...)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/collector/container_energy_collector.go:25
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Update(0xc000304900)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/collector/metric_collector.go:121 +0x128
github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start.func1()
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/manager/manager.go:72 +0x7b
created by github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/manager/manager.go:64 +0x6a
Energy stat: map[0:166626 (166626)] (0)Energy stat: map[0:499878 (499878)] (0)Energy stat: map[0:0 (0)] (0)Energy stat: map[0:5004 (5004)] (0)Energy stat: map[0:55542 (222168)] (0)Energy stat: map[0:55542 (555420)] (0)Energy stat: map[0:0 (0)] (0)Energy stat: map[0:1668 (6672)] (0)Energy stat: map[0:166626 (388794)] (0)Energy stat: map[0:499878 (1055298)] (0)Energy stat: map[0:0 (0)] (0)Energy stat: map[0:5004 (11676)] (0)Energy stat: map[0:55542 (444336)] (0)Energy stat: map[0:55542 (1110840)] (0)Energy stat: map[0:0 (0)] (0)Energy stat: map[0:1668 (13344)] (0)
rootfs commented 1 year ago

@tobby-yuan the models are relocated, please do a fresh install with the latest manifests and kepler image.

tobby-yuan commented 1 year ago

Hi @rootfs, I see the document of kepler. The deployment scenario is deleted so the sidecar scenario is not avaliable now?

Moreover, I do a fresh install through git the latest branch of kepler git clone https://github.com/sustainable-computing-io/kepler.git. Then, I make kepler using make build-manifest OPTS="MODEL_SERVER_DEPLOY". I find the pod of kepler model server have the problem of ImagePullBackOff. The following is the description of kepler model server

kubectl describe pods -n kepler kepler-model-server-6df95dbd98-c7bf2
Name:         kepler-model-server-6df95dbd98-c7bf2
Namespace:    kepler
Priority:     0
Node:         understudent/10.0.10.202
Start Time:   Sat, 19 Aug 2023 05:03:44 +0000
Labels:       app.kubernetes.io/component=model-server
              app.kubernetes.io/name=kepler-model-server
              pod-template-hash=6df95dbd98
              sustainable-computing.io/app=kepler
Annotations:  <none>
Status:       Pending
IP:           10.244.0.60
IPs:
  IP:           10.244.0.60
Controlled By:  ReplicaSet/kepler-model-server-6df95dbd98
Containers:
  server-api:
    Container ID:
    Image:         kepler_model_server
    Image ID:
    Port:          8100/TCP
    Host Port:     0/TCP
    Command:
      python3.8
    Args:
      -u
      src/server/model_server.py
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data from model-data (rw)
      /etc/kepler/kepler.config from cfm (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g8n4g (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  cfm:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kepler-model-server-cfm
    Optional:  false
  model-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-g8n4g:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  57s                default-scheduler  Successfully assigned kepler/kepler-model-server-6df95dbd98-c7bf2 to understudent
  Normal   BackOff    27s (x2 over 54s)  kubelet            Back-off pulling image "kepler_model_server"
  Warning  Failed     27s (x2 over 54s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling    13s (x3 over 57s)  kubelet            Pulling image "kepler_model_server"
  Warning  Failed     11s (x3 over 54s)  kubelet            Failed to pull image "kepler_model_server": rpc error: code = Unknown desc = Error response from daemon: pull access denied for kepler_model_server, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  Warning  Failed     11s (x3 over 54s)  kubelet            Error: ErrImagePull

The following are logs of kepler exporter

kubectl logs -n kepler kepler-exporter-h6wcw
I0819 05:03:50.748416       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0819 05:03:50.753229       1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0819 05:03:50.762202       1 exporter.go:158] Kepler running on version: b57ffa3
I0819 05:03:50.762233       1 config.go:267] using gCgroup ID in the BPF program: true
I0819 05:03:50.762294       1 config.go:269] kernel version: 5.15
I0819 05:03:50.762356       1 config.go:200] kernel source dir is set to /usr/share/kepler/kernel_sources
I0819 05:03:50.762424       1 exporter.go:184] EnabledBPFBatchDelete: true
I0819 05:03:50.762465       1 rapl_msr_util.go:129] failed to open path /dev/cpu/0/msr: no such file or directory
I0819 05:03:50.762547       1 power.go:71] Unable to obtain power, use estimate method
I0819 05:03:50.762589       1 redfish.go:173] failed to initialize node credential: no supported node credential implementation
I0819 05:03:50.762595       1 power.go:56] use acpi to obtain power
I0819 05:03:50.762782       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0819 05:03:50.787792       1 container_energy.go:109] Using the Ratio/AbsModelWeight Power Model to estimate Container Platform Power
I0819 05:03:50.787819       1 container_energy.go:118] Using the Ratio/AbsModelWeight Power Model to estimate Container Component Power
I0819 05:03:50.787840       1 process_power.go:108] Using the Ratio/AbsModelWeight Power Model to estimate Process Platform Power
I0819 05:03:50.787850       1 process_power.go:117] Using the Ratio/AbsModelWeight Power Model to estimate Process Component Power
I0819 05:03:50.803348       1 node_platform_energy.go:52] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power
I0819 05:03:50.809533       1 node_component_energy.go:56] Using the LinearRegressor/AbsPower Power Model to estimate Node Component Power
I0819 05:03:50.809552       1 exporter.go:207] Initializing the GPU collector
I0819 05:03:56.810069       1 watcher.go:66] Using in cluster k8s config
I0819 05:03:56.911214       1 bpf_perf.go:123] LibbpfBuilt: false, BccBuilt: true
modprobe: FATAL: Module kheaders not found in directory /lib/modules/5.15.0-78-generic
chdir(/lib/modules/5.15.0-78-generic/build): No such file or directory
I0819 05:03:56.916961       1 bcc_attacher.go:80] failed to attach the bpf program: <nil>
I0819 05:03:56.916991       1 bcc_attacher.go:155] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=4 -DSET_GROUP_ID]: failed to attach the bpf program: <nil>, from default kernel source.
I0819 05:03:56.917017       1 bcc_attacher.go:158] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64
cannot attach kprobe, probe entry may not exist
I0819 05:03:57.747601       1 bcc_attacher.go:94] attaching kprobe to finish_task_switch failed, trying finish_task_switch.isra.0 instead
I0819 05:03:57.764705       1 bcc_attacher.go:164] Successfully loaded eBPF module with options: [-DMAP_SIZE=10240 -DNUM_CPUS=4 -DSET_GROUP_ID] from kernel source "/usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64"
I0819 05:03:57.764746       1 bcc_attacher.go:183] Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=4 -DSET_GROUP_ID]
I0819 05:03:57.781869       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0819 05:03:57.781959       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0819 05:03:57.782036       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0819 05:03:57.782783       1 exporter.go:270] Started Kepler in 7.020205608s
I0819 05:04:00.828815       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0819 05:04:00.828844       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0819 05:04:00.828870       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0819 05:04:03.793478       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.8809244744518 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 124.69008698216625} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 -2.582378633216328}]}} uncore:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}}]
I0819 05:04:03.793515       1 node_platform_energy.go:81] Failed to get node platform power model Weight for model type AbsPower is not valid: &map[core:{{0 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 0} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0}]}} dram:{{2.6353736295072387 map[] map[kubelet_cpu_usage:{7.027426160337554 103.57941658210045 1.33012561695696} kubelet_memory_bytes:{417264.87763713085 2.0114948671646638e+12 0.16372489537216126}]}} package:{{140.
Lai-Kenny commented 1 year ago

Hi, @rootfs @tobby-yuan . I found make build-manifest OPTS="ESTIMATOR_SIDECAR_DEPLOY" and make build-manifest OPTS="MODEL_SERVER_DEPLOY". They seem to have switched to model server, right?

And I meet the same error about kepler_model_server ImagePullBackOff. Can anyone give me a solution? Many thanks.

sunya-ch commented 1 year ago

@Lai-Kenny I think it might be related to the issue fixed by https://github.com/sustainable-computing-io/kepler/pull/891. Could you refetch the upstream repo and try again? If it is still there, could you share the result of

# for model server case
kubectl get deploy -n kepler -oyaml
# for sidecar case
kubectl get ds -n kepler -oyaml
kgamanji commented 1 year ago

This might be helpful to someone who is testing Kepler locally on their kind cluster:

I have set up Kepler on a kind cluster on my machine. Initially, all the metrics were set to zero. Used the troubleshooting guide and checked I was using cgroup v2, all good on this side. Fixed it by upgrading the Kepler image to pull latest-libbpf tag.

rootfs commented 1 year ago

thank you @kgamanji for the confirmation!

We'll make the default kepler latest to use libbpf in the upcoming release to support all kernels.

stale[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sunya-ch commented 10 months ago

Shall we close the issue?

stale[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.