sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.11k stars 176 forks source link

porting kepler to IBM s390x platform #543

Closed jichenjc closed 1 year ago

jichenjc commented 1 year ago

The goal is to summarize the IBM s390x porting

jichenjc commented 1 year ago

first, install bcc on RHEL8.4+ , then try some command , it works fine

yum install libbpf
yum install bcc-tools

[root@m5404019 ~]# /usr/share/bcc/tools/vfsstat
TIME         READ/s  WRITE/s  FSYNC/s   OPEN/s CREATE/s
08:22:03:       169      130        0        7        0
08:22:04:       379      202        0       17        0
08:22:05:       244      175        0       10        0
08:22:06:        58       71        0        1        0
08:22:07:       296      123        0       80        0
08:22:08:       154      155        0        2        0
[root@m5404019 ~]# uname -a
Linux m5404019 4.18.0-305.el8.s390x #1 SMP Thu Apr 29 09:06:01 EDT 2021 s390x s390x s390x GNU/Linux
[root@m5404019 ~]#
jichenjc commented 1 year ago

need some steps:

cp /usr/bin/gcc /usr/bin/s390x-linux-gnu-gcc ==> so we can compile the c file CGO_ENABLED=1 make cross-build-linux-s390x ==> so go can compile c lang

stuck at no bcc/bcc_common.h .. seems only Ubuntu might be ok .. the key question is the files from repo is not sync with my current env if I copy files from repo

jichenjc commented 1 year ago

we need update LPAR settings to enbale the perf count in order to load those perf count

PERF_COUNT_HW_CPU_CYCLES, PERF_COUNT_HW_INSTRUCTIONS are defined in basic counter set, other events that x86 platform used in kepler are not ok in s390x

currently seems softirq has some problem

 bcc_attacher.go:121] failed to attach perf module with options [-DNUM_CPUS=8 -DSET_GROUP_ID]: failed to load softirq_entry: Module: unable to find tracepoint__irq__softirq_entry, not able to load eBPF modules
rootfs commented 1 year ago

@jichenjc what is the softirq_entry signature in your /proc/kallsyms?

jichenjc commented 1 year ago

@rootfs here's the output ..

# cat /proc/kallsyms | grep softirq_entry
0000000605c513d0 T __traceiter_softirq_entry
0000000605d780d0 t trace_softirq_entry_callback
0000000606711848 d __tracepoint_ptr_softirq_entry
0000000606712696 d __tpstrtab_softirq_entry
0000000606b68f38 D __tracepoint_softirq_entry
0000000606b7e240 d __bpf_trace_tp_map_softirq_entry
0000000606baa828 d event_softirq_entry
0000000606baa9e8 D __SCK__tp_func_softirq_entry
0000000606d8ccf0 d __event_softirq_entry
rootfs commented 1 year ago

can you get the output of cat /sys/kernel/debug/tracing/events/irq/softirq_entry/format

jichenjc commented 1 year ago
# cat /sys/kernel/debug/tracing/events/irq/softirq_entry/format
name: softirq_entry
ID: 69
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:unsigned int vec; offset:8;       size:4; signed:0;

print fmt: "vec=%u [action=%s]", REC->vec, __print_symbolic(REC->vec, { 0, "HI" }, { 1, "TIMER" }, { 2, "NET_TX" }, { 3, "NET_RX" }, { 4, "BLOCK" }, { 5, "IRQ_POLL" }, { 6, "TASKLET" }, { 7, "SCHED" }, { 8, "HRTIMER" }, { 9, "RCU" })
rootfs commented 1 year ago

odd, it should work. Can you try this bcc tool https://github.com/iovisor/bcc/blob/master/tools/softirqs.py

Here is my test:

# python3 ./softirqs.py
Tracing soft irq event time... Hit Ctrl-C to end.
^C
SOFTIRQ          TOTAL_usecs
tasklet                    6
net_tx                    15
block                   1001
timer                  17774
rcu                    24051
net_rx                 29631
sched                 143765
jichenjc commented 1 year ago

weird .. I can do this as well

[root@kvms2p11 tools]# ./softirqs.py
Tracing soft irq event time... Hit Ctrl-C to end.
^C
SOFTIRQ          TOTAL_usecs
rcu                      165
timer                    203
sched                    436
net_rx                   543
jichenjc commented 1 year ago

ok, above issue is solved, the reason is seems due to go-bindata is not installed so the follow up build scripts not run successfully and didn't find the previous change I made to debug ..sorry for the confusion and I will submit a PR to improve this area

jichenjc commented 1 year ago

the latest run here is

# ./kepler
I0227 02:12:41.116321    4645 gpu_nvml.go:45] could not init nvml: <nil>
Failed to init nvml: could not init nvml: <nil>, using dummy source to obtain gpu power
I0227 02:12:41.117055    4645 exporter.go:150] Kepler running on version: v-latest-1-g49f97bb-dirty
I0227 02:12:41.117070    4645 config.go:148] using gCgroup ID in the BPF program: true
I0227 02:12:41.117098    4645 config.go:149] kernel version: 5.14
I0227 02:12:41.117119    4645 config.go:167] EnabledGPU: false
I0227 02:12:41.118677    4645 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/system.slice, this likely cause all cgroup metrics to be 0
I0227 02:12:41.118918    4645 acpi.go:77] Could not find any ACPI power meter path. Is it a VM?
cannot attach kprobe, probe entry may not exist
perf_event_open: No such file or directory
I0227 02:12:41.557034    4645 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0227 02:12:41.557069    4645 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0227 02:12:41.557091    4645 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
I0227 02:12:41.557099    4645 bcc_attacher.go:132] Successfully load eBPF module with option: [-DNUM_CPUS=8 -DSET_GROUP_ID]
I0227 02:12:41.557131    4645 exporter.go:185] failed to start : failed to get response: failed to read from "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
I0227 02:12:41.557200    4645 exporter.go:210] Started Kepler in 440.155421ms
rootfs commented 1 year ago

@jichenjc sounds a good progress! do you see any kepler process metrics?

jichenjc commented 1 year ago

yes, I can see something like , of course, the model need update and there is no pod existing, so only general kepler metrics but data is incorrect ..

kepler_exporter_build_info{branch="",goversion="go1.18.9",revision="",version=""} 1
# HELP kepler_node_energy_stat Several labeled node metrics
# TYPE kepler_node_energy_stat counter
kepler_node_energy_stat{cpu_architecture="unknown",node_block_devices_used="0",node_curr_bytes_read="0",node_curr_bytes_writes="0",node_curr_cache_miss="0",node_curr_container_cpu_usage_seconds_total="0",node_curr_container_memory_working_set_bytes="0",node_curr_cpu_cycles="0",node_curr_cpu_instr="0",node_curr_cpu_time="0",node_curr_energy_in_core_joule="0",node_curr_energy_in_dram_joule="0",node_curr_energy_in_gpu_joule="0",node_curr_energy_in_other_joule="0",node_curr_energy_in_pkg_joule="0",node_curr_energy_in_uncore_joule="0",node_name="kvms2p11"} 0
# HELP kepler_node_nodeInfo Labeled node information
# TYPE kepler_node_nodeInfo counter
kepler_node_nodeInfo{cpu_architecture="unknown"} 1
# HELP kepler_node_other_host_components_joules_total Aggregated RAPL value in other components (platform - package - dram) in joules
# TYPE kepler_node_other_host_components_joules_total counter
kepler_node_other_host_components_joules_total{instance="kvms2p11",mode="dynamic"} 0
kepler_node_other_host_components_joules_total{instance="kvms2p11",mode="idle"} 0
# HELP kepler_node_platform_joules_total Aggregated RAPL value in platform (entire node) in joules
# TYPE kepler_node_platform_joules_total counter
kepler_node_platform_joules_total{instance="kvms2p11",mode="dynamic",source="acpi"} 0
kepler_node_platform_joules_total{instance="kvms2p11",mode="idle",source="acpi"} 0
rootfs commented 1 year ago

@jichenjc can you enable process metric and restart kepler to see if it captures any process metric?

export ENABLE_PROCESS_METRICS=true
./kepler -v 5
jichenjc commented 1 year ago
# ./kepler -v 5
I0227 21:56:31.803782    5229 gpu_nvml.go:45] could not init nvml: <nil>
Failed to init nvml: could not init nvml: <nil>, using dummy source to obtain gpu power
I0227 21:56:31.804412    5229 exporter.go:150] Kepler running on version: v-latest-11-gbd66a13
I0227 21:56:31.804426    5229 config.go:153] using gCgroup ID in the BPF program: true
I0227 21:56:31.804447    5229 config.go:154] kernel version: 5.14
I0227 21:56:31.804469    5229 config.go:172] EnabledGPU: false
I0227 21:56:31.804490    5229 slice_handler.go:145] InitSliceHandler: &{map[] /sys/fs/cgroup/system.slice /sys/fs/cgroup/system.slice /sys/fs/cgroup/system.slice}
I0227 21:56:31.804588    5229 rapl_msr_util.go:143] failed to open path /dev/cpu/0/msr: no such file or directory
I0227 21:56:31.804602    5229 power.go:64] Not able to obtain power, use estimate method
I0227 21:56:31.804609    5229 bcc_attacher.go:144] hardeware counter metrics config true
I0227 21:56:31.804613    5229 bcc_attacher.go:162] irq counter metrics config true
I0227 21:56:31.806024    5229 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/system.slice, this likely cause all cgroup metrics to be 0
I0227 21:56:31.806256    5229 utils.go:58] Available ebpf metrics: [cpu_time irq_net_tx irq_net_rx irq_block]
I0227 21:56:31.806264    5229 utils.go:59] Available counter metrics: [cpu_instr cache_miss cpu_cycles cpu_ref_cycles]
I0227 21:56:31.806269    5229 utils.go:60] Available cgroup metrics from cgroup: [cgroupfs_memory_usage_bytes cgroupfs_cpu_usage_us cgroupfs_system_cpu_usage_us cgroupfs_user_cpu_usage_us cgroupfs_ioread_bytes cgroupfs_iowrite_bytes]
I0227 21:56:31.806276    5229 utils.go:61] Available cgroup metrics from kubelet: []
I0227 21:56:31.806282    5229 utils.go:62] Available I/O metrics: [bytes_read bytes_writes]
I0227 21:56:31.806296    5229 model.go:85] Model Config NODE_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0227 21:56:31.806310    5229 lr.go:171] LR Model (AbsModelWeight): no config
I0227 21:56:31.806317    5229 model.go:77] Model AbsModelWeight initiated (false)
I0227 21:56:31.806322    5229 model.go:85] Model Config NODE_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0227 21:56:31.806332    5229 lr.go:164] LR Model (AbsComponentModelWeight): loadWeightFromURLorLocal(/var/lib/kepler/data/KerasCompWeightFullPipeline.json): <nil>
I0227 21:56:31.806338    5229 lr.go:173] LR Model (AbsComponentModelWeight): open /var/lib/kepler/data/KerasCompWeightFullPipeline.json: no such file or directory
I0227 21:56:31.806344    5229 model.go:77] Model AbsComponentModelWeight initiated (false)
I0227 21:56:31.806352    5229 model.go:85] Model Config CONTAINER_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0227 21:56:31.806359    5229 lr.go:171] LR Model (DynModelWeight): no config
I0227 21:56:31.806365    5229 model.go:77] Model DynModelWeight initiated (false)
I0227 21:56:31.806371    5229 model.go:85] Model Config CONTAINER_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0227 21:56:31.806381    5229 lr.go:164] LR Model (DynComponentModelWeight): loadWeightFromURLorLocal(/var/lib/kepler/data/ScikitMixed.json): <nil>
I0227 21:56:31.806384    5229 lr.go:173] LR Model (DynComponentModelWeight): open /var/lib/kepler/data/ScikitMixed.json: no such file or directory
I0227 21:56:31.806389    5229 model.go:77] Model DynComponentModelWeight initiated (false)
I0227 21:56:31.806400    5229 acpi.go:102] Could not find any ACPI power meter path: lstat /sys/devices/LNXSYSTM:00: no such file or directory
I0227 21:56:31.806409    5229 acpi.go:77] Could not find any ACPI power meter path. Is it a VM?
cannot attach kprobe, probe entry may not exist
perf_event_open: No such file or directory
I0227 21:56:32.246355    5229 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0227 21:56:32.246566    5229 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0227 21:56:32.246588    5229 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
I0227 21:56:32.246596    5229 bcc_attacher.go:132] Successfully load eBPF module with option: [-DNUM_CPUS=8 -DSET_GROUP_ID]
I0227 21:56:32.246618    5229 resolve_container.go:144] failed to get response: failed to read from "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
I0227 21:56:32.246654    5229 node_energy_collector.go:63] No nodeComponentsEnergy found, node components energy metrics is not exposed
F0227 21:56:32.246750    5229 acpi.go:171] open /sys/devices/system/cpu/cpufreq/: no such file or directory

seems expecting /sys/devices/system/cpu/cpufreq/, but I don't have this in this foldre

[root@kvms2p11 ~]# ls /sys/devices/system/cpu
cpu0  cpu1  cpu2  cpu3  cpu4  cpu5  cpu6  cpu7  dispatching  hotplug  isolated  kernel_max  modalias  offline  online  possible  present  rescan  smt  uevent  vulnerabilities
[root@kvms2p11 ~]# ls /sys/devices/system/cpu/cpu0
address  cache  configure  crash_notes  crash_notes_size  dedicated  hotplug  idle_count  idle_time_us  node0  online  polarization  subsystem  topology  uevent

need check how to handle it

jichenjc commented 1 year ago

https://github.com/sustainable-computing-io/kepler/pull/551 after above, I am able to see something like

kepler_process_cache_miss_total{command="in:imjourn",pid="14916766390781214720"} 0 kepler_process_core_joules_total{command="bash",mode="idle",pid="8797500397091553280"} 0 ....

of course, all 0 now in my env due to more update needed

rootfs commented 1 year ago

can you filter and sort the metrics?

get_your_metrics | grep kepler_process_core |sort -k 2 -g
jichenjc commented 1 year ago

it's all 0, I think due to some data not provided ,so checking how to feed the data in the model and generate the power/energy..

kepler_process_core_joules_total{command="bash",mode="dynamic",pid="10961198543066365952"} 0
kepler_process_core_joules_total{command="bash",mode="idle",pid="10961198543066365952"} 0
kepler_process_core_joules_total{command="gmain",mode="dynamic",pid="6631268976326344704"} 0
kepler_process_core_joules_total{command="gmain",mode="idle",pid="6631268976326344704"} 0
kepler_process_core_joules_total{command="in:imjourn",mode="dynamic",pid="14916766390781214720"} 0
kepler_process_core_joules_total{command="in:imjourn",mode="idle",pid="14916766390781214720"} 0
kepler_process_core_joules_total{command="kcompactd0",mode="dynamic",pid="4395513236313604096"} 0
kepler_process_core_joules_total{command="kcompactd0",mode="idle",pid="4395513236313604096"} 0
kepler_process_core_joules_total{command="kepler",mode="dynamic",pid="13771444710545555456"} 0
kepler_process_core_joules_total{command="kepler",mode="idle",pid="13771444710545555456"} 0
kepler_process_core_joules_total{command="khungtaskd",mode="dynamic",pid="4179340454199820288"} 0
kepler_process_core_joules_total{command="khungtaskd",mode="idle",pid="4179340454199820288"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="dynamic",pid="2089670227099910144"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="dynamic",pid="3170534137668829184"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="dynamic",pid="936748722493063168"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="idle",pid="2089670227099910144"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="idle",pid="3170534137668829184"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="idle",pid="936748722493063168"} 0
kepler_process_core_joules_total{command="kworker/0:",mode="dynamic",pid="6205115861586411520"} 0
kepler_process_core_joules_total{command="kworker/0:",mode="idle",pid="6205115861586411520"} 0
kepler_process_core_joules_total{command="kworker/1:",mode="dynamic",pid="12906753582090420224"} 0
kepler_process_core_joules_total{command="kworker/1:",mode="idle",pid="12906753582090420224"} 0
kepler_process_core_joules_total{command="kworker/2:",mode="dynamic",pid="4115727109463212032"} 0
kepler_process_core_joules_total{command="kworker/2:",mode="idle",pid="4115727109463212032"} 0
kepler_process_core_joules_total{command="kworker/3:",mode="dynamic",pid="14778562177216282624"} 0
kepler_process_core_joules_total{command="kworker/3:",mode="idle",pid="14778562177216282624"} 0
kepler_process_core_joules_total{command="kworker/4:",mode="dynamic",pid="12617960255985287168"} 0
kepler_process_core_joules_total{command="kworker/4:",mode="idle",pid="12617960255985287168"} 0
kepler_process_core_joules_total{command="kworker/5:",mode="dynamic",pid="12690580799976636416"} 0
kepler_process_core_joules_total{command="kworker/5:",mode="idle",pid="12690580799976636416"} 0
kepler_process_core_joules_total{command="kworker/6:",mode="dynamic",pid="11321486513256005632"} 0
kepler_process_core_joules_total{command="kworker/6:",mode="idle",pid="11321486513256005632"} 0
kepler_process_core_joules_total{command="kworker/7:",mode="dynamic",pid="12690017850023215104"} 0
kepler_process_core_joules_total{command="kworker/7:",mode="idle",pid="12690017850023215104"} 0
kepler_process_core_joules_total{command="kworker/u7",mode="dynamic",pid="12978811176128348160"} 0
kepler_process_core_joules_total{command="kworker/u7",mode="dynamic",pid="9447989068269879296"} 0
kepler_process_core_joules_total{command="kworker/u7",mode="idle",pid="12978811176128348160"} 0
kepler_process_core_joules_total{command="kworker/u7",mode="idle",pid="9447989068269879296"} 0
kepler_process_core_joules_total{command="lsmd",mode="dynamic",pid="14628536014629502976"} 0
kepler_process_core_joules_total{command="lsmd",mode="idle",pid="14628536014629502976"} 0
kepler_process_core_joules_total{command="NetworkMan",mode="dynamic",pid="18303473310563827712"} 0
kepler_process_core_joules_total{command="NetworkMan",mode="idle",pid="18303473310563827712"} 0
kepler_process_core_joules_total{command="rcu_sched",mode="dynamic",pid="1008806316530991104"} 0
kepler_process_core_joules_total{command="rcu_sched",mode="idle",pid="1008806316530991104"} 0
kepler_process_core_joules_total{command="sshd",mode="dynamic",pid="10889140949028438016"} 0
kepler_process_core_joules_total{command="sshd",mode="idle",pid="10889140949028438016"} 0
kepler_process_core_joules_total{command="swapper/0",mode="dynamic",pid="0"} 0
kepler_process_core_joules_total{command="swapper/0",mode="idle",pid="0"} 0
kepler_process_core_joules_total{command="systemd-jo",mode="dynamic",pid="2234629840105897984"} 0
kepler_process_core_joules_total{command="systemd-jo",mode="idle",pid="2234629840105897984"} 0
kepler_process_core_joules_total{command="systemd-lo",mode="dynamic",pid="15637342331160494080"} 0
kepler_process_core_joules_total{command="systemd-lo",mode="idle",pid="15637342331160494080"} 0
kepler_process_core_joules_total{command="systemd",mode="dynamic",pid="72057594037927936"} 0
kepler_process_core_joules_total{command="systemd",mode="idle",pid="72057594037927936"} 0
kepler_process_core_joules_total{command="xfsaild/dm",mode="dynamic",pid="12970929876780449792"} 0
kepler_process_core_joules_total{command="xfsaild/dm",mode="idle",pid="12970929876780449792"} 0
rootfs commented 1 year ago

@jichenjc can you check other metrics, kepler_process_cpu_cpu_time

rootfs commented 1 year ago

at the time, the process model is based on perf counters:

{"pkg": {"All_Weights": {"Bias_Weight": 24.388564716241596, "Categorical_Variables": {}, "Numerical_Variables": {"cpu_cycles": {"mean": 16455473194.342838, "variance": 5.6028839338593524e+20, "weight": 15.858373957810427}, "cache_miss": {"mean": 19615158.772131525, "variance": 83203515341825.39, "weight": 0.0}, "cpu_instr": {"mean": 23490652312.518856, "variance": 3.0816587041591017e+21, "weight": 8.25749138735891}}}}, "dram": {"All_Weights": {"Bias_Weight": 0.8318076441807906, "Categorical_Variables": {}, "Numerical_Variables": {"cpu_cycles": {"mean": 16455473194.342838, "variance": 5.6028839338593524e+20, "weight": 0.146880994775066}, "cache_miss": {"mean": 19615158.772131525, "variance": 83203515341825.39, "weight": 0.22602678738192125}, "cpu_instr": {"mean": 23490652312.518856, "variance": 3.0816587041591017e+21, "weight": 0.0}}}}}
jichenjc commented 1 year ago

kepler_process_cpu_cpu_time_us{command="kworker/3:",pid="10310146921934618624"} 7.205759403792794e+16 kepler_process_cpu_cpu_time_us{command="bash",pid="4548354148667490304"} 2.161727821137838e+17 kepler_process_cpu_cpu_time_us{command="swapper/2",pid="0"} 3.627368024870224e+18

those 3 has value , the others are all 0

rootfs commented 1 year ago

@jichenjc sounds good. can you create config file and set the process model (follow this function) with BPFOnly model

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.