sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.17k stars 184 forks source link

EKS-A - problem to make kepler work properly - very high / non realistic values #1049

Open barby1138 opened 12 months ago

barby1138 commented 12 months ago

Hi I have problem to make kepler working properly

Subsystem is EKS-A OS Ubuntu 20.04

In grafana I see very high non realistic values

root@server3:~/tsis/kepler/kepler# kubectl logs kepler-exporter-p46rb -n kepler I1113 09:19:19.651274 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I1113 09:19:19.658826 1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127 I1113 09:19:19.670964 1 exporter.go:157] Kepler running on version: 9043798 I1113 09:19:19.670987 1 config.go:274] using gCgroup ID in the BPF program: true I1113 09:19:19.671021 1 config.go:276] kernel version: 5.1 I1113 09:19:19.671135 1 config.go:301] The Idle power will be exposed. Are you running on Baremetal or using single VM per node? I1113 09:19:19.671152 1 exporter.go:169] LibbpfBuilt: false, BccBuilt: true I1113 09:19:19.671212 1 config.go:207] kernel source dir is set to /usr/share/kepler/kernel_sources I1113 09:19:19.671274 1 exporter.go:188] EnabledBPFBatchDelete: true I1113 09:19:19.671338 1 power.go:54] use sysfs to obtain power I1113 09:19:19.671378 1 redfish.go:173] failed to initialize node credential: no supported node credential implementation I1113 09:19:19.671475 1 power.go:56] use acpi to obtain power I1113 09:19:19.788636 1 exporter.go:203] Initializing the GPU collector I1113 09:19:25.793034 1 watcher.go:66] Using in cluster k8s config I1113 09:19:25.893463 1 watcher.go:134] k8s APIserver watcher was started W1113 09:19:28.162204 1 bcc_attacher.go:113] failed to load kprobeset_page_dirty: Module: unable to find kprobeset_page_dirty ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor W1113 09:19:28.263231 1 bcc_attacher.go:119] failed to attach kprobe/set_page_dirty or mark_buffer_dirty: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs. W1113 09:19:28.263267 1 bcc_attacher.go:125] failed to load kprobemark_page_accessed: Module: unable to find kprobemark_page_accessed ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor W1113 09:19:28.315246 1 bcc_attacher.go:129] failed to attach kprobe/mark_page_accessed: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache read events. This will affect the DRAM power model estimation on VMs. I1113 09:19:28.347348 1 bcc_attacher.go:150] Successfully load eBPF module from using bcc I1113 09:19:28.347380 1 bcc_attacher.go:208] Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=28 -DSAMPLE_RATE=0] I1113 09:19:28.372048 1 container_energy.go:114] Using the Ratio/DynPower Power Model to estimate Container Platform Power I1113 09:19:28.372062 1 container_energy.go:115] Container feature names: [cpu_instructions] I1113 09:19:28.372080 1 container_energy.go:124] Using the Ratio/DynPower Power Model to estimate Container Component Power I1113 09:19:28.372090 1 container_energy.go:125] Container feature names: [cpu_instructions cpu_instructions cache_miss gpu_sm_util] I1113 09:19:28.372151 1 process_power.go:113] Using the Ratio/DynPower Power Model to estimate Process Platform Power I1113 09:19:28.372163 1 process_power.go:114] Container feature names: [cpu_instructions] I1113 09:19:28.372180 1 process_power.go:123] Using the Ratio/DynPower Power Model to estimate Process Component Power I1113 09:19:28.372229 1 process_power.go:124] Container feature names: [cpu_instructions cpu_instructions cache_miss gpu_sm_util] I1113 09:19:28.373514 1 exporter.go:267] Started Kepler in 8.702580335s

Seems problem is here:

W1113 09:19:28.263231 1 bcc_attacher.go:119] failed to attach kprobe/set_page_dirty or mark_buffer_dirty: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs. W1113 09:19:28.263267 1 bcc_attacher.go:125] failed to load kprobemark_page_accessed: Module: unable to find kprobemark_page_accessed ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor

Can you pls assist?

Thank you!!!

marceloamaral commented 12 months ago

Can you use the libbpf images? quay.io/sustainable_computing_io/kepler:latest-libbpf

barby1138 commented 12 months ago

Hi thanks for response. Log really looks good now but values are still non realistic Check pics. Note - pods mostly have same (non initialized?) numbers.

Screenshot (57) Screenshot (58)

marceloamaral commented 12 months ago

@barby1138, just for confirmation, EKS operates on VMs, correct? In the logs, Kepler is gathering hardware counters (HC) like CPU instructions. However, VMs usually don't expose hardware counters, which could be a potential problem.

Could you please verify in Prometheus if Kepler metrics are displaying CPU instructions and eBPF time? And which values are you seeing?

barby1138 commented 12 months ago

Hi This is EKS-anywhere its installed on bare metal server (24 cores)

marceloamaral commented 12 months ago

@barby1138 do you know the CPU model? Is it intel x86?

The problem seems to be related to the OTHER power consumption. Maybe there is a problem measuring the platform power (i.e., total node power).

Can you get the plot of the following prometheus metrics?

irate(kepler_node_dram_joules_total{}[1m]) irate(kepler_node_other_joules_total{}[1m]) irate(kepler_node_package_joules_total{}[1m]) irate(kepler_node_platform_joules_total{}[1m])

kepler_container_cpu_instructions_total

MahimaK20 commented 11 months ago

Hello @marceloamaral in which branch can we find the code changes for this image quay.io/sustainable_computing_io/kepler:latest-libbpf

barby1138 commented 11 months ago

Hi

Pls check requested values

root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_other_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_other_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter"},"value":[1701687056.612,"0"]},{"metric":{"name":"kepler_node_other_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter"},"value":[1701687056.612,"257694471.464"]}]}}

root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_package_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_package_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687091.060,"86.851"]},{"metric":{"name":"kepler_node_package_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687091.060,"4957.6"]}]}

root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_platform_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_platform_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"acpi"},"value":[1701687121.796,"0"]},{"metric":{"name":"kepler_node_platform_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"acpi"},"value":[1701687121.796,"515396075.4"]}]}}

root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_dram_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_dram_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687157.973,"13.832"]},{"metric":{"name":"kepler_node_dram_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687157.973,"793.656"]}]}}

barby1138 commented 11 months ago

Hi - any updates? thanks

marceloamaral commented 11 months ago

@barby1138 as we can see the platform power is too high.

Can you run the sensors command in the host to get the node power?

We try to get the node power from the following paths:

hwmonPowerPath      = "/sys/class/hwmon/hwmon2/device/"
acpiPowerPath       = "/sys/devices/LNXSYSTM:00"

the file name is power*_average

Can you please also check those files?

barby1138 commented 10 months ago

Hi

root@server3:~# cat /sys/class/hwmon/hwmon2/device/power*_average 4294967295000

root@server3:~# ls /sys/devices/LNXSYSTM:00/LNXPWRBN:00 driver hid input modalias power subsystem uevent wakeup

thanks

barby1138 commented 10 months ago

some additional info

root@server3:~# cat /sys/class/hwmon/hwmon2/device/power1_average_interval_max 3600000 root@server3:~# cat /sys/class/hwmon/hwmon2/device/power1_average_interval 1000 root@server3:~# root@server3:~# cat /sys/class/hwmon/hwmon2/device/power1_accuracy 97.500% root@server3:~# root@server3:~# root@server3:~# cat /sys/class/hwmon/hwmon2/device/status 15

barby1138 commented 10 months ago

any income? thanks

marceloamaral commented 10 months ago

@barby1138 your system power is too high.

In my system, I have:

cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average
579000000

The value in in microwatts, so we have 579W

Your system has:

 cat /sys/class/hwmon/hwmon2/device/power*_average
4294967295000

So it has 4294967W. Which is a way too high for a gauge value.

I am guessing here, could it be that this power is a counter in your system? Can you check how the value is changing?

Anyway, it seems to be a OS bug, might be better to ignore the platform and other power consumption.

marceloamaral commented 9 months ago

The fix #1222 should also impact this issue.