EKS-A - problem to make kepler work properly - very high / non realistic values

barby1138 commented 12 months ago

Hi I have problem to make kepler working properly

Subsystem is EKS-A OS Ubuntu 20.04

In grafana I see very high non realistic values

root@server3:~/tsis/kepler/kepler# I1113 09:19:19.651274 I1113 09:19:19.658826 I1113 09:19:19.670964 I1113 09:19:19.670987 I1113 09:19:19.671021 I1113 09:19:19.671135 I1113 09:19:19.671152 I1113 09:19:19.671212 I1113 09:19:19.671274 I1113 09:19:19.671338 I1113 09:19:19.671378 I1113 09:19:19.671475 I1113 09:19:19.788636 I1113 09:19:25.793034 I1113 09:19:25.893463 W1113 09:19:28.162204 ioctl(PERF_EVENT_IOC_SET_BPF): ioctl(PERF_EVENT_IOC_SET_BPF): W1113 09:19:28.263231 W1113 09:19:28.263267 ioctl(PERF_EVENT_IOC_SET_BPF): W1113 09:19:28.315246 I1113 09:19:28.347348 I1113 09:19:28.347380 I1113 09:19:28.372048 I1113 09:19:28.372062 I1113 09:19:28.372080 I1113 09:19:28.372090 I1113 09:19:28.372151 I1113 09:19:28.372163 I1113 09:19:28.372180 I1113 09:19:28.372229 I1113 09:19:28.373514 kubectl logs kepler-exporter-p46rb -n kepler 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory 1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127 1 exporter.go:157] Kepler running on version: 9043798 1 config.go:274] using gCgroup ID in the BPF program: true 1 config.go:276] kernel version: 5.1 1 config.go:301] The Idle power will be exposed. Are you running on Baremetal or using single VM per node? 1 exporter.go:169] LibbpfBuilt: false, BccBuilt: true 1 config.go:207] kernel source dir is set to /usr/share/kepler/kernel_sources 1 exporter.go:188] EnabledBPFBatchDelete: true 1 power.go:54] use sysfs to obtain power 1 redfish.go:173] failed to initialize node credential: no supported node credential implementation 1 power.go:56] use acpi to obtain power 1 exporter.go:203] Initializing the GPU collector 1 watcher.go:66] Using in cluster k8s config 1 watcher.go:134] k8s APIserver watcher was started 1 bcc_attacher.go:113] failed to load kprobeset_page_dirty: Module: unable to find kprobeset_page_dirty Bad file descriptor Bad file descriptor 1 bcc_attacher.go:119] failed to attach kprobe/set_page_dirty or mark_buffer_dirty: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs. 1 bcc_attacher.go:125] failed to load kprobemark_page_accessed: Module: unable to find kprobemark_page_accessed Bad file descriptor 1 bcc_attacher.go:129] failed to attach kprobe/mark_page_accessed: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache read events. This will affect the DRAM power model estimation on VMs. 1 bcc_attacher.go:150] Successfully load eBPF module from using bcc 1 bcc_attacher.go:208] Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=28 -DSAMPLE_RATE=0] 1 container_energy.go:114] Using the Ratio/DynPower Power Model to estimate Container Platform Power 1 container_energy.go:115] Container feature names: [cpu_instructions] 1 container_energy.go:124] Using the Ratio/DynPower Power Model to estimate Container Component Power 1 container_energy.go:125] Container feature names: [cpu_instructions cpu_instructions cache_miss gpu_sm_util] 1 process_power.go:113] Using the Ratio/DynPower Power Model to estimate Process Platform Power 1 process_power.go:114] Container feature names: [cpu_instructions] 1 process_power.go:123] Using the Ratio/DynPower Power Model to estimate Process Component Power 1 process_power.go:124] Container feature names: [cpu_instructions cpu_instructions cache_miss gpu_sm_util] 1 exporter.go:267] Started Kepler in 8.702580335s

Seems problem is here:

W1113 09:19:28.263231 1 bcc_attacher.go:119] failed to attach kprobe/set_page_dirty or mark_buffer_dirty: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs. W1113 09:19:28.263267 1 bcc_attacher.go:125] failed to load kprobemark_page_accessed: Module: unable to find kprobemark_page_accessed ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor

Can you pls assist?

Thank you!!!

marceloamaral commented 12 months ago

Can you use the libbpf images? quay.io/sustainable_computing_io/kepler:latest-libbpf

barby1138 commented 12 months ago

Hi thanks for response. Log really looks good now but values are still non realistic Check pics. Note - pods mostly have same (non initialized?) numbers.

Screenshot (57) Screenshot (58)

marceloamaral commented 12 months ago

@barby1138, just for confirmation, EKS operates on VMs, correct? In the logs, Kepler is gathering hardware counters (HC) like CPU instructions. However, VMs usually don't expose hardware counters, which could be a potential problem.

Could you please verify in Prometheus if Kepler metrics are displaying CPU instructions and eBPF time? And which values are you seeing?

barby1138 commented 12 months ago

Hi This is EKS-anywhere its installed on bare metal server (24 cores)

marceloamaral commented 12 months ago

@barby1138 do you know the CPU model? Is it intel x86?

The problem seems to be related to the OTHER power consumption. Maybe there is a problem measuring the platform power (i.e., total node power).

Can you get the plot of the following prometheus metrics?

irate(kepler_node_dram_joules_total{}[1m]) irate(kepler_node_other_joules_total{}[1m]) irate(kepler_node_package_joules_total{}[1m]) irate(kepler_node_platform_joules_total{}[1m])

kepler_container_cpu_instructions_total

MahimaK20 commented 11 months ago

Hello @marceloamaral in which branch can we find the code changes for this image quay.io/sustainable_computing_io/kepler:latest-libbpf

barby1138 commented 11 months ago

Hi

Pls check requested values

root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_other_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_other_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter"},"value":[1701687056.612,"0"]},{"metric":{"name":"kepler_node_other_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter"},"value":[1701687056.612,"257694471.464"]}]}}

root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_package_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_package_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687091.060,"86.851"]},{"metric":{"name":"kepler_node_package_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687091.060,"4957.6"]}]}

root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_platform_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_platform_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"acpi"},"value":[1701687121.796,"0"]},{"metric":{"name":"kepler_node_platform_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"acpi"},"value":[1701687121.796,"515396075.4"]}]}}

root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_dram_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_dram_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687157.973,"13.832"]},{"metric":{"name":"kepler_node_dram_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687157.973,"793.656"]}]}}

barby1138 commented 11 months ago

Hi - any updates? thanks

marceloamaral commented 11 months ago

@barby1138 as we can see the platform power is too high.

Can you run the sensors command in the host to get the node power?

We try to get the node power from the following paths:

hwmonPowerPath      = "/sys/class/hwmon/hwmon2/device/"
acpiPowerPath       = "/sys/devices/LNXSYSTM:00"

the file name is power*_average

Can you please also check those files?

barby1138 commented 10 months ago

Hi

root@server3:~# cat /sys/class/hwmon/hwmon2/device/power*_average 4294967295000

root@server3:~# ls /sys/devices/LNXSYSTM:00/LNXPWRBN:00 driver hid input modalias power subsystem uevent wakeup

thanks

barby1138 commented 10 months ago

some additional info

root@server3:~# cat /sys/class/hwmon/hwmon2/device/power1_average_interval_max 3600000 root@server3:~# cat /sys/class/hwmon/hwmon2/device/power1_average_interval 1000 root@server3:~# root@server3:~# cat /sys/class/hwmon/hwmon2/device/power1_accuracy 97.500% root@server3:~# root@server3:~# root@server3:~# cat /sys/class/hwmon/hwmon2/device/status 15

barby1138 commented 10 months ago

any income? thanks

marceloamaral commented 10 months ago

@barby1138 your system power is too high.

In my system, I have:

cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average
579000000

The value in in microwatts, so we have 579W

Your system has:

 cat /sys/class/hwmon/hwmon2/device/power*_average
4294967295000

So it has 4294967W. Which is a way too high for a gauge value.

I am guessing here, could it be that this power is a counter in your system? Can you check how the value is changing?

Anyway, it seems to be a OS bug, might be better to ignore the platform and other power consumption.

marceloamaral commented 9 months ago

The fix #1222 should also impact this issue.

sustainable-computing-io / kepler

EKS-A - problem to make kepler work properly - very high / non realistic values #1049