Closed jichenjc closed 1 year ago
first, install bcc on RHEL8.4+ , then try some command , it works fine
yum install libbpf
yum install bcc-tools
[root@m5404019 ~]# /usr/share/bcc/tools/vfsstat
TIME READ/s WRITE/s FSYNC/s OPEN/s CREATE/s
08:22:03: 169 130 0 7 0
08:22:04: 379 202 0 17 0
08:22:05: 244 175 0 10 0
08:22:06: 58 71 0 1 0
08:22:07: 296 123 0 80 0
08:22:08: 154 155 0 2 0
[root@m5404019 ~]# uname -a
Linux m5404019 4.18.0-305.el8.s390x #1 SMP Thu Apr 29 09:06:01 EDT 2021 s390x s390x s390x GNU/Linux
[root@m5404019 ~]#
need some steps:
cp /usr/bin/gcc /usr/bin/s390x-linux-gnu-gcc ==> so we can compile the c file CGO_ENABLED=1 make cross-build-linux-s390x ==> so go can compile c lang
stuck at no bcc/bcc_common.h .. seems only Ubuntu might be ok .. the key question is the files from repo is not sync with my current env if I copy files from repo
we need update LPAR settings to enbale the perf count in order to load those perf count
PERF_COUNT_HW_CPU_CYCLES, PERF_COUNT_HW_INSTRUCTIONS are defined in basic counter set, other events that x86 platform used in kepler are not ok in s390x
currently seems softirq has some problem
bcc_attacher.go:121] failed to attach perf module with options [-DNUM_CPUS=8 -DSET_GROUP_ID]: failed to load softirq_entry: Module: unable to find tracepoint__irq__softirq_entry, not able to load eBPF modules
@jichenjc what is the softirq_entry signature in your /proc/kallsyms
?
@rootfs here's the output ..
# cat /proc/kallsyms | grep softirq_entry
0000000605c513d0 T __traceiter_softirq_entry
0000000605d780d0 t trace_softirq_entry_callback
0000000606711848 d __tracepoint_ptr_softirq_entry
0000000606712696 d __tpstrtab_softirq_entry
0000000606b68f38 D __tracepoint_softirq_entry
0000000606b7e240 d __bpf_trace_tp_map_softirq_entry
0000000606baa828 d event_softirq_entry
0000000606baa9e8 D __SCK__tp_func_softirq_entry
0000000606d8ccf0 d __event_softirq_entry
can you get the output of cat /sys/kernel/debug/tracing/events/irq/softirq_entry/format
# cat /sys/kernel/debug/tracing/events/irq/softirq_entry/format
name: softirq_entry
ID: 69
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned int vec; offset:8; size:4; signed:0;
print fmt: "vec=%u [action=%s]", REC->vec, __print_symbolic(REC->vec, { 0, "HI" }, { 1, "TIMER" }, { 2, "NET_TX" }, { 3, "NET_RX" }, { 4, "BLOCK" }, { 5, "IRQ_POLL" }, { 6, "TASKLET" }, { 7, "SCHED" }, { 8, "HRTIMER" }, { 9, "RCU" })
odd, it should work. Can you try this bcc tool https://github.com/iovisor/bcc/blob/master/tools/softirqs.py
Here is my test:
# python3 ./softirqs.py
Tracing soft irq event time... Hit Ctrl-C to end.
^C
SOFTIRQ TOTAL_usecs
tasklet 6
net_tx 15
block 1001
timer 17774
rcu 24051
net_rx 29631
sched 143765
weird .. I can do this as well
[root@kvms2p11 tools]# ./softirqs.py
Tracing soft irq event time... Hit Ctrl-C to end.
^C
SOFTIRQ TOTAL_usecs
rcu 165
timer 203
sched 436
net_rx 543
ok, above issue is solved, the reason is seems due to go-bindata is not installed so the follow up build scripts not run successfully and didn't find the previous change I made to debug ..sorry for the confusion and I will submit a PR to improve this area
the latest run here is
# ./kepler
I0227 02:12:41.116321 4645 gpu_nvml.go:45] could not init nvml: <nil>
Failed to init nvml: could not init nvml: <nil>, using dummy source to obtain gpu power
I0227 02:12:41.117055 4645 exporter.go:150] Kepler running on version: v-latest-1-g49f97bb-dirty
I0227 02:12:41.117070 4645 config.go:148] using gCgroup ID in the BPF program: true
I0227 02:12:41.117098 4645 config.go:149] kernel version: 5.14
I0227 02:12:41.117119 4645 config.go:167] EnabledGPU: false
I0227 02:12:41.118677 4645 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/system.slice, this likely cause all cgroup metrics to be 0
I0227 02:12:41.118918 4645 acpi.go:77] Could not find any ACPI power meter path. Is it a VM?
cannot attach kprobe, probe entry may not exist
perf_event_open: No such file or directory
I0227 02:12:41.557034 4645 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0227 02:12:41.557069 4645 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0227 02:12:41.557091 4645 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
I0227 02:12:41.557099 4645 bcc_attacher.go:132] Successfully load eBPF module with option: [-DNUM_CPUS=8 -DSET_GROUP_ID]
I0227 02:12:41.557131 4645 exporter.go:185] failed to start : failed to get response: failed to read from "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
I0227 02:12:41.557200 4645 exporter.go:210] Started Kepler in 440.155421ms
@jichenjc sounds a good progress! do you see any kepler process metrics?
yes, I can see something like , of course, the model need update and there is no pod existing, so only general kepler metrics but data is incorrect ..
kepler_exporter_build_info{branch="",goversion="go1.18.9",revision="",version=""} 1
# HELP kepler_node_energy_stat Several labeled node metrics
# TYPE kepler_node_energy_stat counter
kepler_node_energy_stat{cpu_architecture="unknown",node_block_devices_used="0",node_curr_bytes_read="0",node_curr_bytes_writes="0",node_curr_cache_miss="0",node_curr_container_cpu_usage_seconds_total="0",node_curr_container_memory_working_set_bytes="0",node_curr_cpu_cycles="0",node_curr_cpu_instr="0",node_curr_cpu_time="0",node_curr_energy_in_core_joule="0",node_curr_energy_in_dram_joule="0",node_curr_energy_in_gpu_joule="0",node_curr_energy_in_other_joule="0",node_curr_energy_in_pkg_joule="0",node_curr_energy_in_uncore_joule="0",node_name="kvms2p11"} 0
# HELP kepler_node_nodeInfo Labeled node information
# TYPE kepler_node_nodeInfo counter
kepler_node_nodeInfo{cpu_architecture="unknown"} 1
# HELP kepler_node_other_host_components_joules_total Aggregated RAPL value in other components (platform - package - dram) in joules
# TYPE kepler_node_other_host_components_joules_total counter
kepler_node_other_host_components_joules_total{instance="kvms2p11",mode="dynamic"} 0
kepler_node_other_host_components_joules_total{instance="kvms2p11",mode="idle"} 0
# HELP kepler_node_platform_joules_total Aggregated RAPL value in platform (entire node) in joules
# TYPE kepler_node_platform_joules_total counter
kepler_node_platform_joules_total{instance="kvms2p11",mode="dynamic",source="acpi"} 0
kepler_node_platform_joules_total{instance="kvms2p11",mode="idle",source="acpi"} 0
@jichenjc can you enable process metric and restart kepler to see if it captures any process metric?
export ENABLE_PROCESS_METRICS=true
./kepler -v 5
# ./kepler -v 5
I0227 21:56:31.803782 5229 gpu_nvml.go:45] could not init nvml: <nil>
Failed to init nvml: could not init nvml: <nil>, using dummy source to obtain gpu power
I0227 21:56:31.804412 5229 exporter.go:150] Kepler running on version: v-latest-11-gbd66a13
I0227 21:56:31.804426 5229 config.go:153] using gCgroup ID in the BPF program: true
I0227 21:56:31.804447 5229 config.go:154] kernel version: 5.14
I0227 21:56:31.804469 5229 config.go:172] EnabledGPU: false
I0227 21:56:31.804490 5229 slice_handler.go:145] InitSliceHandler: &{map[] /sys/fs/cgroup/system.slice /sys/fs/cgroup/system.slice /sys/fs/cgroup/system.slice}
I0227 21:56:31.804588 5229 rapl_msr_util.go:143] failed to open path /dev/cpu/0/msr: no such file or directory
I0227 21:56:31.804602 5229 power.go:64] Not able to obtain power, use estimate method
I0227 21:56:31.804609 5229 bcc_attacher.go:144] hardeware counter metrics config true
I0227 21:56:31.804613 5229 bcc_attacher.go:162] irq counter metrics config true
I0227 21:56:31.806024 5229 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/system.slice, this likely cause all cgroup metrics to be 0
I0227 21:56:31.806256 5229 utils.go:58] Available ebpf metrics: [cpu_time irq_net_tx irq_net_rx irq_block]
I0227 21:56:31.806264 5229 utils.go:59] Available counter metrics: [cpu_instr cache_miss cpu_cycles cpu_ref_cycles]
I0227 21:56:31.806269 5229 utils.go:60] Available cgroup metrics from cgroup: [cgroupfs_memory_usage_bytes cgroupfs_cpu_usage_us cgroupfs_system_cpu_usage_us cgroupfs_user_cpu_usage_us cgroupfs_ioread_bytes cgroupfs_iowrite_bytes]
I0227 21:56:31.806276 5229 utils.go:61] Available cgroup metrics from kubelet: []
I0227 21:56:31.806282 5229 utils.go:62] Available I/O metrics: [bytes_read bytes_writes]
I0227 21:56:31.806296 5229 model.go:85] Model Config NODE_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0227 21:56:31.806310 5229 lr.go:171] LR Model (AbsModelWeight): no config
I0227 21:56:31.806317 5229 model.go:77] Model AbsModelWeight initiated (false)
I0227 21:56:31.806322 5229 model.go:85] Model Config NODE_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0227 21:56:31.806332 5229 lr.go:164] LR Model (AbsComponentModelWeight): loadWeightFromURLorLocal(/var/lib/kepler/data/KerasCompWeightFullPipeline.json): <nil>
I0227 21:56:31.806338 5229 lr.go:173] LR Model (AbsComponentModelWeight): open /var/lib/kepler/data/KerasCompWeightFullPipeline.json: no such file or directory
I0227 21:56:31.806344 5229 model.go:77] Model AbsComponentModelWeight initiated (false)
I0227 21:56:31.806352 5229 model.go:85] Model Config CONTAINER_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0227 21:56:31.806359 5229 lr.go:171] LR Model (DynModelWeight): no config
I0227 21:56:31.806365 5229 model.go:77] Model DynModelWeight initiated (false)
I0227 21:56:31.806371 5229 model.go:85] Model Config CONTAINER_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0227 21:56:31.806381 5229 lr.go:164] LR Model (DynComponentModelWeight): loadWeightFromURLorLocal(/var/lib/kepler/data/ScikitMixed.json): <nil>
I0227 21:56:31.806384 5229 lr.go:173] LR Model (DynComponentModelWeight): open /var/lib/kepler/data/ScikitMixed.json: no such file or directory
I0227 21:56:31.806389 5229 model.go:77] Model DynComponentModelWeight initiated (false)
I0227 21:56:31.806400 5229 acpi.go:102] Could not find any ACPI power meter path: lstat /sys/devices/LNXSYSTM:00: no such file or directory
I0227 21:56:31.806409 5229 acpi.go:77] Could not find any ACPI power meter path. Is it a VM?
cannot attach kprobe, probe entry may not exist
perf_event_open: No such file or directory
I0227 21:56:32.246355 5229 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0227 21:56:32.246566 5229 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0227 21:56:32.246588 5229 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
I0227 21:56:32.246596 5229 bcc_attacher.go:132] Successfully load eBPF module with option: [-DNUM_CPUS=8 -DSET_GROUP_ID]
I0227 21:56:32.246618 5229 resolve_container.go:144] failed to get response: failed to read from "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
I0227 21:56:32.246654 5229 node_energy_collector.go:63] No nodeComponentsEnergy found, node components energy metrics is not exposed
F0227 21:56:32.246750 5229 acpi.go:171] open /sys/devices/system/cpu/cpufreq/: no such file or directory
seems expecting /sys/devices/system/cpu/cpufreq/, but I don't have this in this foldre
[root@kvms2p11 ~]# ls /sys/devices/system/cpu
cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 dispatching hotplug isolated kernel_max modalias offline online possible present rescan smt uevent vulnerabilities
[root@kvms2p11 ~]# ls /sys/devices/system/cpu/cpu0
address cache configure crash_notes crash_notes_size dedicated hotplug idle_count idle_time_us node0 online polarization subsystem topology uevent
need check how to handle it
https://github.com/sustainable-computing-io/kepler/pull/551 after above, I am able to see something like
kepler_process_cache_miss_total{command="in:imjourn",pid="14916766390781214720"} 0 kepler_process_core_joules_total{command="bash",mode="idle",pid="8797500397091553280"} 0 ....
of course, all 0 now in my env due to more update needed
can you filter and sort the metrics?
get_your_metrics | grep kepler_process_core |sort -k 2 -g
it's all 0, I think due to some data not provided ,so checking how to feed the data in the model and generate the power/energy..
kepler_process_core_joules_total{command="bash",mode="dynamic",pid="10961198543066365952"} 0
kepler_process_core_joules_total{command="bash",mode="idle",pid="10961198543066365952"} 0
kepler_process_core_joules_total{command="gmain",mode="dynamic",pid="6631268976326344704"} 0
kepler_process_core_joules_total{command="gmain",mode="idle",pid="6631268976326344704"} 0
kepler_process_core_joules_total{command="in:imjourn",mode="dynamic",pid="14916766390781214720"} 0
kepler_process_core_joules_total{command="in:imjourn",mode="idle",pid="14916766390781214720"} 0
kepler_process_core_joules_total{command="kcompactd0",mode="dynamic",pid="4395513236313604096"} 0
kepler_process_core_joules_total{command="kcompactd0",mode="idle",pid="4395513236313604096"} 0
kepler_process_core_joules_total{command="kepler",mode="dynamic",pid="13771444710545555456"} 0
kepler_process_core_joules_total{command="kepler",mode="idle",pid="13771444710545555456"} 0
kepler_process_core_joules_total{command="khungtaskd",mode="dynamic",pid="4179340454199820288"} 0
kepler_process_core_joules_total{command="khungtaskd",mode="idle",pid="4179340454199820288"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="dynamic",pid="2089670227099910144"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="dynamic",pid="3170534137668829184"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="dynamic",pid="936748722493063168"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="idle",pid="2089670227099910144"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="idle",pid="3170534137668829184"} 0
kepler_process_core_joules_total{command="ksoftirqd/",mode="idle",pid="936748722493063168"} 0
kepler_process_core_joules_total{command="kworker/0:",mode="dynamic",pid="6205115861586411520"} 0
kepler_process_core_joules_total{command="kworker/0:",mode="idle",pid="6205115861586411520"} 0
kepler_process_core_joules_total{command="kworker/1:",mode="dynamic",pid="12906753582090420224"} 0
kepler_process_core_joules_total{command="kworker/1:",mode="idle",pid="12906753582090420224"} 0
kepler_process_core_joules_total{command="kworker/2:",mode="dynamic",pid="4115727109463212032"} 0
kepler_process_core_joules_total{command="kworker/2:",mode="idle",pid="4115727109463212032"} 0
kepler_process_core_joules_total{command="kworker/3:",mode="dynamic",pid="14778562177216282624"} 0
kepler_process_core_joules_total{command="kworker/3:",mode="idle",pid="14778562177216282624"} 0
kepler_process_core_joules_total{command="kworker/4:",mode="dynamic",pid="12617960255985287168"} 0
kepler_process_core_joules_total{command="kworker/4:",mode="idle",pid="12617960255985287168"} 0
kepler_process_core_joules_total{command="kworker/5:",mode="dynamic",pid="12690580799976636416"} 0
kepler_process_core_joules_total{command="kworker/5:",mode="idle",pid="12690580799976636416"} 0
kepler_process_core_joules_total{command="kworker/6:",mode="dynamic",pid="11321486513256005632"} 0
kepler_process_core_joules_total{command="kworker/6:",mode="idle",pid="11321486513256005632"} 0
kepler_process_core_joules_total{command="kworker/7:",mode="dynamic",pid="12690017850023215104"} 0
kepler_process_core_joules_total{command="kworker/7:",mode="idle",pid="12690017850023215104"} 0
kepler_process_core_joules_total{command="kworker/u7",mode="dynamic",pid="12978811176128348160"} 0
kepler_process_core_joules_total{command="kworker/u7",mode="dynamic",pid="9447989068269879296"} 0
kepler_process_core_joules_total{command="kworker/u7",mode="idle",pid="12978811176128348160"} 0
kepler_process_core_joules_total{command="kworker/u7",mode="idle",pid="9447989068269879296"} 0
kepler_process_core_joules_total{command="lsmd",mode="dynamic",pid="14628536014629502976"} 0
kepler_process_core_joules_total{command="lsmd",mode="idle",pid="14628536014629502976"} 0
kepler_process_core_joules_total{command="NetworkMan",mode="dynamic",pid="18303473310563827712"} 0
kepler_process_core_joules_total{command="NetworkMan",mode="idle",pid="18303473310563827712"} 0
kepler_process_core_joules_total{command="rcu_sched",mode="dynamic",pid="1008806316530991104"} 0
kepler_process_core_joules_total{command="rcu_sched",mode="idle",pid="1008806316530991104"} 0
kepler_process_core_joules_total{command="sshd",mode="dynamic",pid="10889140949028438016"} 0
kepler_process_core_joules_total{command="sshd",mode="idle",pid="10889140949028438016"} 0
kepler_process_core_joules_total{command="swapper/0",mode="dynamic",pid="0"} 0
kepler_process_core_joules_total{command="swapper/0",mode="idle",pid="0"} 0
kepler_process_core_joules_total{command="systemd-jo",mode="dynamic",pid="2234629840105897984"} 0
kepler_process_core_joules_total{command="systemd-jo",mode="idle",pid="2234629840105897984"} 0
kepler_process_core_joules_total{command="systemd-lo",mode="dynamic",pid="15637342331160494080"} 0
kepler_process_core_joules_total{command="systemd-lo",mode="idle",pid="15637342331160494080"} 0
kepler_process_core_joules_total{command="systemd",mode="dynamic",pid="72057594037927936"} 0
kepler_process_core_joules_total{command="systemd",mode="idle",pid="72057594037927936"} 0
kepler_process_core_joules_total{command="xfsaild/dm",mode="dynamic",pid="12970929876780449792"} 0
kepler_process_core_joules_total{command="xfsaild/dm",mode="idle",pid="12970929876780449792"} 0
@jichenjc can you check other metrics, kepler_process_cpu_cpu_time
at the time, the process model is based on perf counters:
{"pkg": {"All_Weights": {"Bias_Weight": 24.388564716241596, "Categorical_Variables": {}, "Numerical_Variables": {"cpu_cycles": {"mean": 16455473194.342838, "variance": 5.6028839338593524e+20, "weight": 15.858373957810427}, "cache_miss": {"mean": 19615158.772131525, "variance": 83203515341825.39, "weight": 0.0}, "cpu_instr": {"mean": 23490652312.518856, "variance": 3.0816587041591017e+21, "weight": 8.25749138735891}}}}, "dram": {"All_Weights": {"Bias_Weight": 0.8318076441807906, "Categorical_Variables": {}, "Numerical_Variables": {"cpu_cycles": {"mean": 16455473194.342838, "variance": 5.6028839338593524e+20, "weight": 0.146880994775066}, "cache_miss": {"mean": 19615158.772131525, "variance": 83203515341825.39, "weight": 0.22602678738192125}, "cpu_instr": {"mean": 23490652312.518856, "variance": 3.0816587041591017e+21, "weight": 0.0}}}}}
kepler_process_cpu_cpu_time_us{command="kworker/3:",pid="10310146921934618624"} 7.205759403792794e+16 kepler_process_cpu_cpu_time_us{command="bash",pid="4548354148667490304"} 2.161727821137838e+17 kepler_process_cpu_cpu_time_us{command="swapper/2",pid="0"} 3.627368024870224e+18
those 3 has value , the others are all 0
@jichenjc sounds good. can you create config file and set the process model (follow this function) with BPFOnly model
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
The goal is to summarize the IBM s390x porting