Closed rootfs closed 4 months ago
@rootfs could you please share the Kepler logs of the latest code?
Sharing my local testing here. Running Kepler on Fedora 40 with:
$ sudo ENABLE_PROCESS_METRICS=true ./_output/bin/linux_amd64/kepler
I see some processes that are 0, but plenty that are non-zero:
$ curl localhost:8888/metrics | grep kepler_process_bpf_cpu_time_ms_total |sort -k 2 -g
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="442204",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="prometheus",container_id="39531a28d61d45452f2861f02fc3552c2e06a7496e31962ec92202c585d2447e",pid="3574948",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="kubelet",container_id="system_processes",pid="3571585",source="bpf",vm_id=""} 0.045
kepler_process_bpf_cpu_time_ms_total{command="sshd",container_id="system_processes",pid="441843",source="bpf",vm_id=""} 0.06
kepler_process_bpf_cpu_time_ms_total{command="containerd",container_id="system_processes",pid="3569041",source="bpf",vm_id=""} 0.197
kepler_process_bpf_cpu_time_ms_total{command="kubelet",container_id="system_processes",pid="3569914",source="bpf",vm_id=""} 0.295
kepler_process_bpf_cpu_time_ms_total{command="grafana-server",container_id="6bc252d0b5144c72b556df201bff50b76d3ff9c92b1e64070cf6e1d306676e9d",pid="3575172",source="bpf",vm_id=""} 1.081
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="442190",source="bpf",vm_id=""} 1.59
kepler_process_bpf_cpu_time_ms_total{command="zsh",container_id="system_processes",pid="504330",source="bpf",vm_id=""} 4.319
kepler_process_bpf_cpu_time_ms_total{command="kworker/u64:5",container_id="kernel_processes",pid="482503",source="bpf",vm_id=""} 36736.108
EDIT: Running as a container on F40 gives similar results. I'll test our RHEL shortly.
@marceloamaral here is the log and metrics. Note there is a fraction of cpu time but it is not right
# curl localhost:8888/metrics |grep bpf_cpu_time |sort -k 2 -g
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 101k 0 101k 0 0 49.6M 0 --:--:-- --:--:-- --:--:-- 49.6M
# HELP kepler_container_bpf_cpu_time_ms_total Aggregated value in bpf_cpu_time_ms value from bpf
# HELP kepler_process_bpf_cpu_time_ms_total Aggregated value in bpf_cpu_time_ms value from bpf
# TYPE kepler_container_bpf_cpu_time_ms_total counter
# TYPE kepler_process_bpf_cpu_time_ms_total counter
kepler_container_bpf_cpu_time_ms_total{container_id="system_processes",container_name="system_processes",container_namespace="system",pod_name="system_processes",source="bpf"} 0
kepler_process_bpf_cpu_time_ms_total{command="(systemd)",container_id="system_processes",pid="1437591",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="auditd",container_id="system_processes",pid="906",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="bash",container_id="system_processes",pid="1533232",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="chronyd",container_id="system_processes",pid="940",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="code-89de5a8d4d",container_id="system_processes",pid="1533274",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="coredns",container_id="system_processes",pid="997911",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="coredns",container_id="system_processes",pid="997933",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="etcd",container_id="system_processes",pid="996366",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="grafana-server",container_id="system_processes",pid="1258",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="kindnetd",container_id="system_processes",pid="997347",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="kube-proxy",container_id="system_processes",pid="997230",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="manager",container_id="system_processes",pid="998644",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="manager",container_id="system_processes",pid="998839",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="manager",container_id="system_processes",pid="998912",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="1533315",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="1533359",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="1533370",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="1533387",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="pmcd",container_id="system_processes",pid="1471",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="pmdalinux",container_id="system_processes",pid="1478",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="pmdaproc",container_id="system_processes",pid="1476",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="pmie",container_id="system_processes",pid="1672189",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="pmie",container_id="system_processes",pid="1672428",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="pmlogger",container_id="system_processes",pid="3444157",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="pmlogger",container_id="system_processes",pid="3444480",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="pmproxy",container_id="system_processes",pid="1026281",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="redis-server",container_id="system_processes",pid="1281",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="rhsmcertd",container_id="system_processes",pid="1024",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="sshd",container_id="system_processes",pid="1533228",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="systemd-journal",container_id="system_processes",pid="761",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="systemd-logind",container_id="system_processes",pid="945",source="bpf",vm_id=""} 0
kepler_container_bpf_cpu_time_ms_total{container_id="kernel_processes",container_name="kernel_processes",container_namespace="kernel",pod_name="kernel_processes",source="bpf"} 0.435
kepler_process_bpf_cpu_time_ms_total{command="kworker/u32:2",container_id="kernel_processes",pid="461311",source="bpf",vm_id=""} 0.435
# podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
af8df9491396 docker.io/kindest/node@sha256:047357ac0cfea04663786a612ba1eaba9702bef25227a794b52890dd8bcd692e 9 days ago Up 9 days 127.0.0.1:34371->6443/tcp kind-control-plane
b24fc0b33df6 quay.io/sustainable_computing_io/kepler:latest 19 hours ago Up 19 hours kepler
# podman logs kepler
I0613 15:44:12.517466 594556 gpu.go:38] Trying to initialize GPU collector using dcgm
W0613 15:44:12.517620 594556 gpu_dcgm.go:104] There is no DCGM daemon running in the host: libdcgm.so not Found
W0613 15:44:12.517644 594556 gpu_dcgm.go:108] Could not start DCGM. Error: libdcgm.so not Found
I0613 15:44:12.517650 594556 gpu.go:45] Error initializing dcgm: not able to connect to DCGM: libdcgm.so not Found
I0613 15:44:12.517653 594556 gpu.go:38] Trying to initialize GPU collector using nvidia-nvml
I0613 15:44:12.517686 594556 gpu.go:45] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND
I0613 15:44:12.517690 594556 gpu.go:38] Trying to initialize GPU collector using dummy
I0613 15:44:12.517693 594556 gpu.go:42] Using dummy to obtain gpu power
I0613 15:44:12.518835 594556 exporter.go:100] Kepler running on version: latest
I0613 15:44:12.518861 594556 config.go:280] using gCgroup ID in the BPF program: true
I0613 15:44:12.518885 594556 config.go:282] kernel version: 5.14
I0613 15:44:12.518961 594556 config.go:307] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0613 15:44:12.518995 594556 redfish.go:169] failed to get redfish credential file path
libbpf: loading /var/lib/kepler/bpfassets/kepler.bpfel.o
libbpf: elf: section(3) tp/sched/sched_switch, size 1952, link 0, flags 6, type=1
libbpf: sec 'tp/sched/sched_switch': found program 'kepler_sched_switch_trace' at insn offset 0 (0 bytes), code size 244 insns (1952 bytes)
libbpf: elf: section(4) .reltp/sched/sched_switch, size 320, link 32, flags 40, type=9
libbpf: elf: section(5) tp/irq/softirq_entry, size 144, link 0, flags 6, type=1
libbpf: sec 'tp/irq/softirq_entry': found program 'kepler_irq_trace' at insn offset 0 (0 bytes), code size 18 insns (144 bytes)
libbpf: elf: section(6) .reltp/irq/softirq_entry, size 16, link 32, flags 40, type=9
libbpf: elf: section(7) fexit/mark_page_accessed, size 104, link 0, flags 6, type=1
libbpf: sec 'fexit/mark_page_accessed': found program 'kepler_read_page_trace' at insn offset 0 (0 bytes), code size 13 insns (104 bytes)
libbpf: elf: section(8) .relfexit/mark_page_accessed, size 16, link 32, flags 40, type=9
libbpf: elf: section(9) tp/writeback/writeback_dirty_folio, size 104, link 0, flags 6, type=1
libbpf: sec 'tp/writeback/writeback_dirty_folio': found program 'kepler_write_page_trace' at insn offset 0 (0 bytes), code size 13 insns (104 bytes)
libbpf: elf: section(10) .reltp/writeback/writeback_dirty_folio, size 16, link 32, flags 40, type=9
libbpf: elf: section(11) .rodata.config, size 4, link 0, flags 2, type=1
libbpf: elf: section(12) .bss, size 4, link 0, flags 3, type=8
libbpf: elf: section(13) .maps, size 288, link 0, flags 3, type=1
libbpf: elf: section(14) license, size 13, link 0, flags 3, type=1
libbpf: license of /var/lib/kepler/bpfassets/kepler.bpfel.o is Dual BSD/GPL
libbpf: elf: section(23) .BTF, size 5063, link 0, flags 0, type=1
libbpf: elf: section(25) .BTF.ext, size 2104, link 0, flags 0, type=1
libbpf: elf: section(32) .symtab, size 1080, link 1, flags 0, type=2
libbpf: looking for externs among 45 symbols...
libbpf: collected 0 externs total
libbpf: map 'pid_tgid_map': at sec_idx 13, offset 0.
libbpf: map 'pid_tgid_map': found type = 9.
libbpf: map 'pid_tgid_map': found key [6], sz = 4.
libbpf: map 'pid_tgid_map': found value [6], sz = 4.
libbpf: map 'pid_tgid_map': found max_entries = 32768.
libbpf: map 'processes': at sec_idx 13, offset 32.
libbpf: map 'processes': found type = 9.
libbpf: map 'processes': found key [6], sz = 4.
libbpf: map 'processes': found value [14], sz = 96.
libbpf: map 'processes': found max_entries = 32768.
libbpf: map 'pid_time_map': at sec_idx 13, offset 64.
libbpf: map 'pid_time_map': found type = 9.
libbpf: map 'pid_time_map': found key [6], sz = 4.
libbpf: map 'pid_time_map': found value [16], sz = 8.
libbpf: map 'pid_time_map': found max_entries = 32768.
libbpf: map 'cpu_cycles_event_reader': at sec_idx 13, offset 96.
libbpf: map 'cpu_cycles_event_reader': found type = 4.
libbpf: map 'cpu_cycles_event_reader': found key [2], sz = 4.
libbpf: map 'cpu_cycles_event_reader': found value [6], sz = 4.
libbpf: map 'cpu_cycles_event_reader': found max_entries = 128.
libbpf: map 'cpu_cycles': at sec_idx 13, offset 128.
libbpf: map 'cpu_cycles': found type = 2.
libbpf: map 'cpu_cycles': found key [6], sz = 4.
libbpf: map 'cpu_cycles': found value [16], sz = 8.
libbpf: map 'cpu_cycles': found max_entries = 128.
libbpf: map 'cpu_instructions_event_reader': at sec_idx 13, offset 160.
libbpf: map 'cpu_instructions_event_reader': found type = 4.
libbpf: map 'cpu_instructions_event_reader': found key [2], sz = 4.
libbpf: map 'cpu_instructions_event_reader': found value [6], sz = 4.
libbpf: map 'cpu_instructions_event_reader': found max_entries = 128.
libbpf: map 'cpu_instructions': at sec_idx 13, offset 192.
libbpf: map 'cpu_instructions': found type = 2.
libbpf: map 'cpu_instructions': found key [6], sz = 4.
libbpf: map 'cpu_instructions': found value [16], sz = 8.
libbpf: map 'cpu_instructions': found max_entries = 128.
libbpf: map 'cache_miss_event_reader': at sec_idx 13, offset 224.
libbpf: map 'cache_miss_event_reader': found type = 4.
libbpf: map 'cache_miss_event_reader': found key [2], sz = 4.
libbpf: map 'cache_miss_event_reader': found value [6], sz = 4.
libbpf: map 'cache_miss_event_reader': found max_entries = 128.
libbpf: map 'cache_miss': at sec_idx 13, offset 256.
libbpf: map 'cache_miss': found type = 2.
libbpf: map 'cache_miss': found key [6], sz = 4.
libbpf: map 'cache_miss': found value [16], sz = 8.
libbpf: map 'cache_miss': found max_entries = 128.
libbpf: map '.rodata.config' (global data): at sec_idx 11, offset 0, flags 80.
libbpf: map 9 is ".rodata.config"
libbpf: map 'kepler.bss' (global data): at sec_idx 12, offset 0, flags 400.
libbpf: map 10 is "kepler.bss"
libbpf: sec '.reltp/sched/sched_switch': collecting relocation for section(3) 'tp/sched/sched_switch'
libbpf: sec '.reltp/sched/sched_switch': relo #0: insn #21 against 'cpu_cycles_event_reader'
libbpf: prog 'kepler_sched_switch_trace': found map 3 (cpu_cycles_event_reader, sec 13, off 96) for insn #21
libbpf: sec '.reltp/sched/sched_switch': relo #1: insn #33 against 'cpu_cycles'
libbpf: prog 'kepler_sched_switch_trace': found map 4 (cpu_cycles, sec 13, off 128) for insn #33
libbpf: sec '.reltp/sched/sched_switch': relo #2: insn #47 against 'cpu_cycles'
libbpf: prog 'kepler_sched_switch_trace': found map 4 (cpu_cycles, sec 13, off 128) for insn #47
libbpf: sec '.reltp/sched/sched_switch': relo #3: insn #57 against 'cpu_instructions_event_reader'
libbpf: prog 'kepler_sched_switch_trace': found map 5 (cpu_instructions_event_reader, sec 13, off 160) for insn #57
libbpf: sec '.reltp/sched/sched_switch': relo #4: insn #67 against 'cpu_instructions'
libbpf: prog 'kepler_sched_switch_trace': found map 6 (cpu_instructions, sec 13, off 192) for insn #67
libbpf: sec '.reltp/sched/sched_switch': relo #5: insn #79 against 'cpu_instructions'
libbpf: prog 'kepler_sched_switch_trace': found map 6 (cpu_instructions, sec 13, off 192) for insn #79
libbpf: sec '.reltp/sched/sched_switch': relo #6: insn #90 against 'cache_miss_event_reader'
libbpf: prog 'kepler_sched_switch_trace': found map 7 (cache_miss_event_reader, sec 13, off 224) for insn #90
libbpf: sec '.reltp/sched/sched_switch': relo #7: insn #101 against 'cache_miss'
libbpf: prog 'kepler_sched_switch_trace': found map 8 (cache_miss, sec 13, off 256) for insn #101
libbpf: sec '.reltp/sched/sched_switch': relo #8: insn #115 against 'cache_miss'
libbpf: prog 'kepler_sched_switch_trace': found map 8 (cache_miss, sec 13, off 256) for insn #115
libbpf: sec '.reltp/sched/sched_switch': relo #9: insn #122 against 'pid_time_map'
libbpf: prog 'kepler_sched_switch_trace': found map 2 (pid_time_map, sec 13, off 64) for insn #122
libbpf: sec '.reltp/sched/sched_switch': relo #10: insn #129 against 'pid_time_map'
libbpf: prog 'kepler_sched_switch_trace': found map 2 (pid_time_map, sec 13, off 64) for insn #129
libbpf: sec '.reltp/sched/sched_switch': relo #11: insn #141 against '.rodata.config'
libbpf: prog 'kepler_sched_switch_trace': found data map 9 (.rodata.config, sec 11, off 0) for insn 141
libbpf: sec '.reltp/sched/sched_switch': relo #12: insn #148 against 'counter_sched_switch'
libbpf: prog 'kepler_sched_switch_trace': found data map 10 (kepler.bss, sec 12, off 0) for insn 148
libbpf: sec '.reltp/sched/sched_switch': relo #13: insn #157 against '.rodata.config'
libbpf: prog 'kepler_sched_switch_trace': found data map 9 (.rodata.config, sec 11, off 0) for insn 157
libbpf: sec '.reltp/sched/sched_switch': relo #14: insn #165 against 'pid_tgid_map'
libbpf: prog 'kepler_sched_switch_trace': found map 0 (pid_tgid_map, sec 13, off 0) for insn #165
libbpf: sec '.reltp/sched/sched_switch': relo #15: insn #170 against 'processes'
libbpf: prog 'kepler_sched_switch_trace': found map 1 (processes, sec 13, off 32) for insn #170
libbpf: sec '.reltp/sched/sched_switch': relo #16: insn #194 against 'pid_time_map'
libbpf: prog 'kepler_sched_switch_trace': found map 2 (pid_time_map, sec 13, off 64) for insn #194
libbpf: sec '.reltp/sched/sched_switch': relo #17: insn #204 against 'processes'
libbpf: prog 'kepler_sched_switch_trace': found map 1 (processes, sec 13, off 32) for insn #204
libbpf: sec '.reltp/sched/sched_switch': relo #18: insn #230 against 'processes'
libbpf: prog 'kepler_sched_switch_trace': found map 1 (processes, sec 13, off 32) for insn #230
libbpf: sec '.reltp/sched/sched_switch': relo #19: insn #237 against 'pid_tgid_map'
libbpf: prog 'kepler_sched_switch_trace': found map 0 (pid_tgid_map, sec 13, off 0) for insn #237
libbpf: sec '.reltp/irq/softirq_entry': collecting relocation for section(5) 'tp/irq/softirq_entry'
libbpf: sec '.reltp/irq/softirq_entry': relo #0: insn #6 against 'processes'
libbpf: prog 'kepler_irq_trace': found map 1 (processes, sec 13, off 32) for insn #6
libbpf: sec '.relfexit/mark_page_accessed': collecting relocation for section(7) 'fexit/mark_page_accessed'
libbpf: sec '.relfexit/mark_page_accessed': relo #0: insn #4 against 'processes'
libbpf: prog 'kepler_read_page_trace': found map 1 (processes, sec 13, off 32) for insn #4
libbpf: sec '.reltp/writeback/writeback_dirty_folio': collecting relocation for section(9) 'tp/writeback/writeback_dirty_folio'
libbpf: sec '.reltp/writeback/writeback_dirty_folio': relo #0: insn #4 against 'processes'
libbpf: prog 'kepler_write_page_trace': found map 1 (processes, sec 13, off 32) for insn #4
I0613 15:44:12.532841 594556 exporter.go:140] 16 CPU cores detected. Resizing eBPF Perf Event Arrays
libbpf: loading kernel BTF '/sys/kernel/btf/vmlinux': 0
libbpf: map 'pid_tgid_map': created successfully, fd=8
libbpf: map 'processes': created successfully, fd=9
libbpf: map 'pid_time_map': created successfully, fd=10
libbpf: map 'cpu_cycles_event_reader': created successfully, fd=11
libbpf: map 'cpu_cycles': created successfully, fd=12
libbpf: map 'cpu_instructions_event_reader': created successfully, fd=13
libbpf: map 'cpu_instructions': created successfully, fd=14
libbpf: map 'cache_miss_event_reader': created successfully, fd=15
libbpf: map 'cache_miss': created successfully, fd=16
libbpf: map '.rodata.config': created successfully, fd=17
libbpf: map 'kepler.bss': created successfully, fd=18
I0613 15:44:12.559884 594556 exporter.go:260] Successfully load eBPF module from libbpf object
I0613 15:44:12.560438 594556 watcher.go:68] Using in cluster k8s config
I0613 15:44:12.560443 594556 watcher.go:75] failed to get config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
I0613 15:44:12.560449 594556 watcher.go:128] k8s APIserver watcher was not enabled
I0613 15:44:12.560469 594556 prometheus_collector.go:90] Registered Process Prometheus metrics
I0613 15:44:12.560488 594556 prometheus_collector.go:95] Registered Container Prometheus metrics
I0613 15:44:12.560497 594556 prometheus_collector.go:100] Registered VM Prometheus metrics
I0613 15:44:12.560507 594556 prometheus_collector.go:104] Registered Node Prometheus metrics
I0613 15:44:12.560584 594556 exporter.go:201] starting to listen on 0.0.0.0:8888
I0613 15:44:12.560791 594556 exporter.go:215] Started Kepler in 41.964788ms
In fact, most of the BPF metrics (cpu time, cpu instructions, etc) are now becoming 0. Only pid 0 or one single kernel pid get all the metrics.
I checked 0.7.2 (our last version of using sched_switch tracepoint), it gets the same symptom. 0.7.3 (used kprobe) and onwards till 0.7.8 work right to me (i.e. all active processes have bpf metrics).
The root cause appears to be the wrong padding.
On my RHEL 9, here is the sched_switch args:
# cat /sys/kernel/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 317
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned char common_preempt_lazy_count; offset:8; size:1; signed:0;
field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
field:pid_t prev_pid; offset:28; size:4; signed:1;
field:int prev_prio; offset:32; size:4; signed:1;
field:long prev_state; offset:40; size:8; signed:1;
field:char next_comm[TASK_COMM_LEN]; offset:48; size:16; signed:1;
field:pid_t next_pid; offset:64; size:4; signed:1;
field:int next_prio; offset:68; size:4; signed:1;
print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, (REC->prev_state & ((((0x0000 | 0x0001 | 0x0002 | 0x0004 | 0x0008 | 0x0010 | 0x0020 | 0x0040) + 1) << 1) - 1)) ? __print_flags(REC->prev_state & ((((0x0000 | 0x0001 | 0x0002 | 0x0004 | 0x0008 | 0x0010 | 0x0020 | 0x0040) + 1) << 1) - 1), "|", { 0x0001, "S" }, { 0x0002, "D" }, { 0x0004, "T" }, { 0x0008, "t" }, { 0x0010, "X" }, { 0x0020, "Z" }, { 0x0040, "P" }, { 0x0080, "I" }) : "R", REC->prev_state & (((0x0000 | 0x0001 | 0x0002 | 0x0004 | 0x0008 | 0x0010 | 0x0020 | 0x0040) + 1) << 1) ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio
The paddings are 12 bytes.
After I applied the right padding, the cpu time is back
kepler_process_bpf_cpu_time_ms_total{command="sshd",container_id="system_processes",pid="2444904",source="bpf",vm_id=""} 0.04
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="2445013",source="bpf",vm_id=""} 0.096
kepler_process_bpf_cpu_time_ms_total{command="prometheus",container_id="9421e5dd315b98cf271ccbb060f55731a9d484d77a796d756e587746c0490983",pid="17376",source="bpf",vm_id=""} 0.097
kepler_process_bpf_cpu_time_ms_total{command="etcd",container_id="274c41da423ac675bdd6d44d1204ada79e4aaa0ea4908dfb69991c6315409585",pid="12013",source="bpf",vm_id=""} 0.101
kepler_process_bpf_cpu_time_ms_total{command="systemd",container_id="system_processes",pid="10035",source="bpf",vm_id=""} 0.218
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="2445057",source="bpf",vm_id=""} 0.556
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="2444188",source="bpf",vm_id=""} 0.56
kepler_process_bpf_cpu_time_ms_total{command="gsd-housekeepin",container_id="system_processes",pid="10492",source="bpf",vm_id=""} 1.652
kepler_process_bpf_cpu_time_ms_total{command="ps",container_id="system_processes",pid="2877291",source="bpf",vm_id=""} 10.911
kepler_process_bpf_cpu_time_ms_total{command="kubelet",container_id="system_processes",pid="2211",source="bpf",vm_id=""} 11.975
kepler_process_bpf_cpu_time_ms_total{command="adapter",container_id="55decdb48c42e4592573692fcd287a2a4755ece0ed4d3230edbedda277bd2ee4",pid="17216",source="bpf",vm_id=""} 26.682
kepler_process_bpf_cpu_time_ms_total{command="gvfsd-trash",container_id="system_processes",pid="11157",source="bpf",vm_id=""} 39.622
kepler_process_bpf_cpu_time_ms_total{command="gvfs-udisks2-vo",container_id="system_processes",pid="10414",source="bpf",vm_id=""} 56.6
kepler_process_bpf_cpu_time_ms_total{command="swapper/2",container_id="kernel_processes",pid="0",source="bpf",vm_id=""} 12480.942
Here is the diff:
# git diff
diff --git a/bpfassets/libbpf/src/kepler.bpf.c b/bpfassets/libbpf/src/kepler.bpf.c
index cac1923c..ea764931 100644
--- a/bpfassets/libbpf/src/kepler.bpf.c
+++ b/bpfassets/libbpf/src/kepler.bpf.c
@@ -197,8 +197,11 @@ static inline void collect_metrics_and_reset_counters(
// This struct is defined according to the following format file:
// /sys/kernel/tracing/events/sched/sched_switch/format
struct sched_switch_info {
- /* The first 8 bytes is not allowed to read */
- u64 pad;
+ unsigned short common_type;
+ unsigned char common_flags;
+ unsigned char common_preempt_count;
+ int common_pid;
+ unsigned char common_preempt_lazy_count;
char prev_comm[16];
pid_t prev_pid;
@dave-tucker @marceloamaral can you check the paddings on your env? If the paddings are different (e.g. as this one), we should switch back to kprobe for consistency.
@dave-tucker @marceloamaral can you check the paddings on your env? If the paddings are different (e.g. as this one), we should switch back to kprobe for consistency.
The point of tracepoints is that the ABI is stable. kprobes do not have a stable ABI and therefore could change between kernel versions.
Let me look into this and see what options we have.
@rootfs ok, so I've got 2 options from you:
EDIT: There is only 1 option. #1543 worked fine for the sched_switch tracepoint, but not for the softirq_entry tracepoint which was also affected by this issues.
Interesting:
In @rootfs RHEL 9 case, we have:
# cat /sys/kernel/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 317
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned char common_preempt_lazy_count; offset:8; size:1; signed:0;
common_type = 2 * 8 = 16 bits
common_flags = 1 * 8 = 8 bits
common_preempt_count = 1 * 8 = 8 bits
common_pid = 4 * 8 = 32 bits
common_preempt_lazy_count = 1 * 8 = 8 bits
Total size in bits = 16 + 8 + 8 + 32 + 8 = 72 bits
In my Ubuntu system:
sudo cat /sys/kernel/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 318
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char prev_comm[16]; offset:8; size:16; signed:1;
field:pid_t prev_pid; offset:24; size:4; signed:1;
field:int prev_prio; offset:28; size:4; signed:1;
field:long prev_state; offset:32; size:8; signed:1;
field:char next_comm[16]; offset:40; size:16; signed:1;
field:pid_t next_pid; offset:56; size:4; signed:1;
field:int next_prio; offset:60; size:4; signed:1;
print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, (REC->prev_state & ((((0x0000 | 0x0001 | 0x0002 | 0x0004 | 0x0008 | 0x0010 | 0x0020 | 0x0040) + 1) << 1) - 1)) ? __print_flags(REC->prev_state & ((((0x0000 | 0x0001 | 0x0002 | 0x0004 | 0x0008 | 0x0010 | 0x0020 | 0x0040) + 1) << 1) - 1), "|", { 0x0001, "S" }, { 0x0002, "D" }, { 0x0004, "T" }, { 0x0008, "t" }, { 0x0010, "X" }, { 0x0020, "Z" }, { 0x0040, "P" }, { 0x0080, "I" }) : "R", REC->prev_state & (((0x0000 | 0x0001 | 0x0002 | 0x0004 | 0x0008 | 0x0010 | 0x0020 | 0x0040) + 1) << 1) ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio
common_type = 2 * 8 = 16 bits
common_flags = 1 * 8 = 8 bits
common_preempt_count= 1 * 8 = 8 bits
common_pid = 4 * 8 = 32 bits
Total size in bits = 16 + 8 + 8 + 32 = 64 bits
Basically, REL9 has another variable common_preempt_lazy_count
....
Although tracepoints are intended to have a stable Application Binary Interface (ABI)—ensuring that their definitions and the data they expose is consistent across different kernel versions—we can see that thesched_switch
structure is different between REL9 and Ubuntu, which breaks this assumption.
On the other hand, despite kprobes
not having a stable ABI, using eBPF to gather information within kprobes
can provide a stable tracing mechanism across different operating systems. This is because eBPF macros seems to be stable across different OS.
Also, this paper here shows that ebpf (Gregg’s bpftrace) has similar performance in kprobe and tracepoint entry.
I am wondering if we can actually detect if the field
is present using bpf_core_field_exists
and typecast to an appropriate struct - https://github.com/libbpf/libbpf/blob/42065ea6627ff6e1ab4c65e51042a70fbf30ff7c/src/bpf_core_read.h#L187C9-L187C30
See: https://nakryiko.com/posts/bpf-core-reference-guide/#dealing-with-kernel-changes-and-feature-detection
What happened?
We need BPF CPU time to estimate dynamic power on VM. But the metrics are recently unstable.
In release-0.7.8, the kepler_process_bpf_cpu_time_ms_total is correct:
In release-0.7.10, all the CPU time goes to pid 0:
In latest code, all cpu time is zero
What did you expect to happen?
We need BPF CPU time to estimate dynamic power on VM. But the metrics are recently unstable.
How can we reproduce it (as minimally and precisely as possible)?
Start container-kepler service on RHEL9 and run the following:
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)