Open barby1138 opened 12 months ago
Can you use the libbpf images?
quay.io/sustainable_computing_io/kepler:latest-libbpf
Hi thanks for response. Log really looks good now but values are still non realistic Check pics. Note - pods mostly have same (non initialized?) numbers.
@barby1138, just for confirmation, EKS operates on VMs, correct? In the logs, Kepler is gathering hardware counters (HC) like CPU instructions. However, VMs usually don't expose hardware counters, which could be a potential problem.
Could you please verify in Prometheus if Kepler metrics are displaying CPU instructions and eBPF time? And which values are you seeing?
Hi This is EKS-anywhere its installed on bare metal server (24 cores)
@barby1138 do you know the CPU model? Is it intel x86?
The problem seems to be related to the OTHER power consumption. Maybe there is a problem measuring the platform power (i.e., total node power).
Can you get the plot of the following prometheus metrics?
irate(kepler_node_dram_joules_total{}[1m])
irate(kepler_node_other_joules_total{}[1m])
irate(kepler_node_package_joules_total{}[1m])
irate(kepler_node_platform_joules_total{}[1m])
kepler_container_cpu_instructions_total
Hello @marceloamaral in which branch can we find the code changes for this image quay.io/sustainable_computing_io/kepler:latest-libbpf
Hi
Pls check requested values
root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_other_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_other_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter"},"value":[1701687056.612,"0"]},{"metric":{"name":"kepler_node_other_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter"},"value":[1701687056.612,"257694471.464"]}]}}
root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_package_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_package_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687091.060,"86.851"]},{"metric":{"name":"kepler_node_package_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687091.060,"4957.6"]}]}
root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_platform_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_platform_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"acpi"},"value":[1701687121.796,"0"]},{"metric":{"name":"kepler_node_platform_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"acpi"},"value":[1701687121.796,"515396075.4"]}]}}
root@server3:~/tsis/kepler/kepler# curl http://10.105.110.245:9090/api/v1/query?query=kepler_node_dram_joules_total {"status":"success","data":{"resultType":"vector","result":[{"metric":{"name":"kepler_node_dram_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"dynamic","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687157.973,"13.832"]},{"metric":{"name":"kepler_node_dram_joules_total","container":"kepler-exporter","endpoint":"http","exported_instance":"server3","instance":"server3","job":"kepler-exporter","mode":"idle","namespace":"kepler","package":"0","pod":"kepler-exporter-mfk7m","service":"kepler-exporter","source":"rapl"},"value":[1701687157.973,"793.656"]}]}}
Hi - any updates? thanks
@barby1138 as we can see the platform power is too high.
Can you run the sensors
command in the host to get the node power?
We try to get the node power from the following paths:
hwmonPowerPath = "/sys/class/hwmon/hwmon2/device/"
acpiPowerPath = "/sys/devices/LNXSYSTM:00"
the file name is power*_average
Can you please also check those files?
Hi
root@server3:~# cat /sys/class/hwmon/hwmon2/device/power*_average 4294967295000
root@server3:~# ls /sys/devices/LNXSYSTM:00/LNXPWRBN:00 driver hid input modalias power subsystem uevent wakeup
thanks
some additional info
root@server3:~# cat /sys/class/hwmon/hwmon2/device/power1_average_interval_max 3600000 root@server3:~# cat /sys/class/hwmon/hwmon2/device/power1_average_interval 1000 root@server3:~# root@server3:~# cat /sys/class/hwmon/hwmon2/device/power1_accuracy 97.500% root@server3:~# root@server3:~# root@server3:~# cat /sys/class/hwmon/hwmon2/device/status 15
any income? thanks
@barby1138 your system power is too high.
In my system, I have:
cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average
579000000
The value in in microwatts, so we have 579W
Your system has:
cat /sys/class/hwmon/hwmon2/device/power*_average
4294967295000
So it has 4294967W. Which is a way too high for a gauge value.
I am guessing here, could it be that this power is a counter in your system? Can you check how the value is changing?
Anyway, it seems to be a OS bug, might be better to ignore the platform and other power consumption.
The fix #1222 should also impact this issue.
Hi I have problem to make kepler working properly
Subsystem is EKS-A OS Ubuntu 20.04
In grafana I see very high non realistic values
root@server3:~/tsis/kepler/kepler# kubectl logs kepler-exporter-p46rb -n kepler I1113 09:19:19.651274 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I1113 09:19:19.658826 1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127 I1113 09:19:19.670964 1 exporter.go:157] Kepler running on version: 9043798 I1113 09:19:19.670987 1 config.go:274] using gCgroup ID in the BPF program: true I1113 09:19:19.671021 1 config.go:276] kernel version: 5.1 I1113 09:19:19.671135 1 config.go:301] The Idle power will be exposed. Are you running on Baremetal or using single VM per node? I1113 09:19:19.671152 1 exporter.go:169] LibbpfBuilt: false, BccBuilt: true I1113 09:19:19.671212 1 config.go:207] kernel source dir is set to /usr/share/kepler/kernel_sources I1113 09:19:19.671274 1 exporter.go:188] EnabledBPFBatchDelete: true I1113 09:19:19.671338 1 power.go:54] use sysfs to obtain power I1113 09:19:19.671378 1 redfish.go:173] failed to initialize node credential: no supported node credential implementation I1113 09:19:19.671475 1 power.go:56] use acpi to obtain power I1113 09:19:19.788636 1 exporter.go:203] Initializing the GPU collector I1113 09:19:25.793034 1 watcher.go:66] Using in cluster k8s config I1113 09:19:25.893463 1 watcher.go:134] k8s APIserver watcher was started W1113 09:19:28.162204 1 bcc_attacher.go:113] failed to load kprobeset_page_dirty: Module: unable to find kprobeset_page_dirty ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor W1113 09:19:28.263231 1 bcc_attacher.go:119] failed to attach kprobe/set_page_dirty or mark_buffer_dirty: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs. W1113 09:19:28.263267 1 bcc_attacher.go:125] failed to load kprobemark_page_accessed: Module: unable to find kprobemark_page_accessed ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor W1113 09:19:28.315246 1 bcc_attacher.go:129] failed to attach kprobe/mark_page_accessed: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache read events. This will affect the DRAM power model estimation on VMs. I1113 09:19:28.347348 1 bcc_attacher.go:150] Successfully load eBPF module from using bcc I1113 09:19:28.347380 1 bcc_attacher.go:208] Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=28 -DSAMPLE_RATE=0] I1113 09:19:28.372048 1 container_energy.go:114] Using the Ratio/DynPower Power Model to estimate Container Platform Power I1113 09:19:28.372062 1 container_energy.go:115] Container feature names: [cpu_instructions] I1113 09:19:28.372080 1 container_energy.go:124] Using the Ratio/DynPower Power Model to estimate Container Component Power I1113 09:19:28.372090 1 container_energy.go:125] Container feature names: [cpu_instructions cpu_instructions cache_miss gpu_sm_util] I1113 09:19:28.372151 1 process_power.go:113] Using the Ratio/DynPower Power Model to estimate Process Platform Power I1113 09:19:28.372163 1 process_power.go:114] Container feature names: [cpu_instructions] I1113 09:19:28.372180 1 process_power.go:123] Using the Ratio/DynPower Power Model to estimate Process Component Power I1113 09:19:28.372229 1 process_power.go:124] Container feature names: [cpu_instructions cpu_instructions cache_miss gpu_sm_util] I1113 09:19:28.373514 1 exporter.go:267] Started Kepler in 8.702580335s
Seems problem is here:
W1113 09:19:28.263231 1 bcc_attacher.go:119] failed to attach kprobe/set_page_dirty or mark_buffer_dirty: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs. W1113 09:19:28.263267 1 bcc_attacher.go:125] failed to load kprobemark_page_accessed: Module: unable to find kprobemark_page_accessed ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor
Can you pls assist?
Thank you!!!