Closed andersonandrei closed 1 year ago
Also, Kepler does not export metrics of new pods. I can just see the metrics for those pods that were already in the platform before Kepler's deployment. I need to re-deploy Kepler to see the metrics of such new pods.
I reported this before but I think your version is 0d3e6ce
which is really new version ,so can you help report with more detail on this in another isue? @andersonandrei
from those log I am wondering 1) you didn't enable cgroupv2 2) you didn't enable eBPF seems you are using Azure VM which I don't know those can be enabled or not
I0301 14:42:12.708526 1 power.go:64] Not able to obtain power, use estimate method
I0301 14:42:12.711548 1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
I0301 14:42:13.016250 1 exporter.go:168] Initializing the GPU collector
perf_event_open: No such file or directory
I0301 14:42:15.542780 1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542866 1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542922 1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542990 1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
@andersonandrei msr
error is benign: if msr cannot be accessed (typical on VMs), the power calculation model is then switched to linear regression based method.
What metrics can you see, can you post them here? In addition, can you change the verbosity to 5 (like below) and share the log?
containers:
- args:
- /usr/bin/kepler -v=5
from those log I am wondering 1) you didn't enable cgroupv2 2) you didn't enable eBPF seems you are using Azure VM which I don't know those can be enabled or not
I0301 14:42:12.708526 1 power.go:64] Not able to obtain power, use estimate method I0301 14:42:12.711548 1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0 I0301 14:42:13.016250 1 exporter.go:168] Initializing the GPU collector perf_event_open: No such file or directory I0301 14:42:15.542780 1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory perf_event_open: No such file or directory I0301 14:42:15.542866 1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory perf_event_open: No such file or directory I0301 14:42:15.542922 1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory perf_event_open: No such file or directory I0301 14:42:15.542990 1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
I can see that cgroup2 is enabled:
root@server-vmss000001:/# grep cgroup /proc/filesystems
nodev cgroup
nodev cgroup2
root@server-vmss000001:/# uname -a
Linux server-vmss000001 5.4.0-1091-azure #96~18.04.1-Ubuntu SMP Tue Aug 30 19:15:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
But I'm not sure how to check if eBPF is enabled. Can you help me, please?
Thanks a lot!
Also, Kepler does not export metrics of new pods. I can just see the metrics for those pods that were already in the platform before Kepler's deployment. I need to re-deploy Kepler to see the metrics of such new pods.
I reported this before but I think your version is
0d3e6ce
which is really new version ,so can you help report with more detail on this in another isue? @andersonandrei
Hello @jichenjc , thanks for your answer.
I can try to provide more details, but I'm not sure if I should open a new issue for that, or if I should comment on this issue. What do you suggest?
Thanks!
let's keep this issue open and share comments here
@andersonandrei
msr
error is benign: if msr cannot be accessed (typical on VMs), the power calculation model is then switched to linear regression based method.What metrics can you see, can you post them here? In addition, can you change the verbosity to 5 (like below) and share the log?
containers: - args: - /usr/bin/kepler -v=5
Hello @rootfs , thanks for your answer.
I just updated the verbosity:
I0302 15:26:45.355370 1 gpu_nvml.go:45] could not init nvml: <nil>
Failed to init nvml: could not init nvml: <nil>, using dummy source to obtain gpu power
I0302 15:26:45.356652 1 exporter.go:150] Kepler running on version: 71ef9dc
I0302 15:26:45.356831 1 config.go:153] using gCgroup ID in the BPF program: true
I0302 15:26:45.356983 1 config.go:154] kernel version: 5.4
I0302 15:26:45.357110 1 config.go:172] EnabledGPU: true
I0302 15:26:45.357253 1 slice_handler.go:145] InitSliceHandler: &{map[] /sys/fs/cgroup/cpu/system.slice /sys/fs/cgroup/memory/system.slice /sys/fs/cgroup/blkio/system.slice}
I0302 15:26:45.357503 1 rapl_msr_util.go:143] failed to open path /dev/cpu/1/msr: no such file or directory
I0302 15:26:45.357662 1 power.go:64] Not able to obtain power, use estimate method
I0302 15:26:45.357796 1 bcc_attacher.go:144] hardeware counter metrics config true
I0302 15:26:45.357868 1 bcc_attacher.go:162] irq counter metrics config true
I0302 15:26:45.360564 1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
I0302 15:26:45.543243 1 utils.go:58] Available ebpf metrics: [cpu_time irq_net_tx irq_net_rx irq_block]
I0302 15:26:45.543521 1 utils.go:59] Available counter metrics: [cpu_cycles cpu_ref_cycles cpu_instr cache_miss]
I0302 15:26:45.543534 1 utils.go:60] Available cgroup metrics from cgroup: [cgroupfs_kernel_memory_usage_bytes cgroupfs_tcp_memory_usage_bytes cgroupfs_cpu_usage_us cgroupfs_system_cpu_usage_us cgroupfs_user_cpu_usage_us cgroupfs_memory_usage_bytes]
I0302 15:26:45.543560 1 utils.go:61] Available cgroup metrics from kubelet: [container_cpu_usage_seconds_total container_memory_working_set_bytes]
I0302 15:26:45.543581 1 utils.go:62] Available I/O metrics: [bytes_read bytes_writes]
I0302 15:26:45.543674 1 model.go:86] Model Config NODE_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0302 15:26:45.543696 1 lr.go:171] LR Model (AbsModelWeight): no config
I0302 15:26:45.543701 1 model.go:78] Model AbsModelWeight initiated (false)
I0302 15:26:45.543710 1 model.go:86] Model Config NODE_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0302 15:26:45.544816 1 lr.go:164] LR Model (AbsComponentModelWeight): loadWeightFromURLorLocal(/var/lib/kepler/data/KerasCompWeightFullPipeline.json): map[core:{{17.49994468688965 map[cpu_architecture:map[Alder Lake:{0.5408945679664612} Broadwell:{17.9639892578125} Cascade Lake:{-0.49166440963745117} Coffee Lake:{0.5166589617729187} Haswell:{-0.5789095163345337} Ivy Bridge:{-0.024028241634368896} Sandy Bridge:{0.5239214301109314} Sky Lake:{0.4193417429924011}]] map[cpu_cycles:{6.85713664e+09 7.560771917192364e+18 -0.11352460086345673} cpu_instr:{3.374244864e+09 8.408530291701842e+17 -0.414739191532135} cpu_time:{192019.5 5.2761312e+08 -0.06457684189081192}]}} dram:{{17.49994468688965 map[cpu_architecture:map[Alder Lake:{-0.11559933423995972} Broadwell:{16.972564697265625} Cascade Lake:{0.5505847334861755} Coffee Lake:{-0.4564790725708008} Haswell:{-0.13912856578826904} Ivy Bridge:{-0.018331050872802734} Sandy Bridge:{-0.6695247888565063} Sky Lake:{0.29698115587234497}]] map[cache_miss:{9.329199e+06 2.2145245642752e+13 -0.4680119752883911} container_memory_working_set_bytes:{253952 1.93474854912e+11 0.6805805563926697}]}}]
I0302 15:26:45.544906 1 model.go:78] Model AbsComponentModelWeight initiated (true)
I0302 15:26:45.544919 1 model.go:86] Model Config CONTAINER_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0302 15:26:45.544927 1 lr.go:171] LR Model (DynModelWeight): no config
I0302 15:26:45.544931 1 model.go:78] Model DynModelWeight initiated (false)
I0302 15:26:45.544954 1 model.go:86] Model Config CONTAINER_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CgroupOnly/ScikitMixed/ScikitMixed.json}
I0302 15:26:45.571424 1 lr.go:164] LR Model (DynComponentModelWeight): loadWeightFromURLorLocal(https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CgroupOnly/ScikitMixed/ScikitMixed.json): map[dram:{{0.8339537395297559 map[] map[cgroupfs_cpu_usage_us:{6.557768917682226e+06 1.0111459601975505e+14 0.30000764700549093} cgroupfs_memory_usage_bytes:{1.1568395713957231e+07 1.4831755031743084e+14 0.010739137690300415} cgroupfs_system_cpu_usage_us:{320904.6193377788 3.07640381448747e+10 0.2612100015047149} cgroupfs_user_cpu_usage_us:{6.236864298459416e+06 1.0251834339511184e+14 0}]}} pkg:{{24.603798628318298 map[] map[cgroupfs_cpu_usage_us:{6.557768917682226e+06 1.0111459601975505e+14 0} cgroupfs_memory_usage_bytes:{1.1568395713957231e+07 1.4831755031743084e+14 0} cgroupfs_system_cpu_usage_us:{320904.6193377788 3.07640381448747e+10 0} cgroupfs_user_cpu_usage_us:{6.236864298459416e+06 1.0251834339511184e+14 24.50009569917875}]}}]
I0302 15:26:45.571713 1 model.go:78] Model DynComponentModelWeight initiated (true)
I0302 15:26:45.571842 1 model.go:86] Model Config PROCESS_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0302 15:26:45.572019 1 lr.go:171] LR Model (DynModelWeight): no config
I0302 15:26:45.572119 1 model.go:78] Model DynModelWeight initiated (false)
I0302 15:26:45.572256 1 model.go:86] Model Config PROCESS_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0302 15:26:45.577089 1 lr.go:164] LR Model (DynComponentModelWeight): loadWeightFromURLorLocal(https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CounterOnly/ScikitMixed/ScikitMixed.json): map[dram:{{0.8318076441807906 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0.22602678738192125} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 0.146880994775066} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 0}]}} pkg:{{24.388564716241596 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 15.858373957810427} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 8.25749138735891}]}}]
I0302 15:26:45.577314 1 model.go:78] Model DynComponentModelWeight initiated (true)
I0302 15:26:45.577409 1 exporter.go:168] Initializing the GPU collector
I0302 15:26:45.580707 1 acpi.go:75] Using the ACPI power meter path: /sys/class/hwmon/hwmon2/device/
perf_event_open: No such file or directory
I0302 15:26:47.492899 1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0302 15:26:47.493069 1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0302 15:26:47.493216 1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0302 15:26:47.493336 1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
I0302 15:26:47.493365 1 bcc_attacher.go:132] Successfully load eBPF module with option: [-DNUM_CPUS=2]
I0302 15:26:47.531205 1 node_energy_collector.go:60] Node components power model collection is supported
I0302 15:26:47.532142 1 exporter.go:210] Started Kepler in 2.175512593s
I0302 15:26:51.778338 1 container_cgroup_collector.go:28] overall cgroup stats &{map[:[{/sys/fs/cgroup/cpu/system.slice} {/sys/fs/cgroup/memory/system.slice} {/sys/fs/cgroup/blkio/system.slice}]] /sys/fs/cgroup/cpu/system.slice /sys/fs/cgroup/memory/system.slice /sys/fs/cgroup/blkio/system.slice}
I0302 15:26:52.094148 1 node_energy_collector.go:60] Node components power model collection is supported
I0302 15:26:52.094461 1 metric_collector.go:137] energy from pod/container (0 active processes): name: owdev-kafkaprovider-69977b75cc-hf27p/kafkaprovider namespace: monitoring
@andersonandrei the logs looks kepler is running. Can you get the kepler container metrics, e.g. through this command?
kubectl exec -ti -n kepler daemonset/kepler-exporter -- bash -c "curl localhost:9102/metrics|grep kepler_container_core|sort -k 2 -g"
# HELP kepler_container_core_joules_total Aggregated RAPL value in core in joules
# TYPE kepler_container_core_joules_total counter
kepler_container_core_joules_total{command="",container_name="alarmprovider",container_namespace="monitoring",mode="idle",pod_name="owdev-alarmprovider-7b6dbf84d9-5x5ls"} 0
kepler_container_core_joules_total{command="",container_name="alertmanager",container_namespace="monitoring",mode="idle",pod_name="alertmanager-prometheus-kube-prometheus-alertmanager-0"} 0
kepler_container_core_joules_total{command="",container_name="apigateway",container_namespace="monitoring",mode="idle",pod_name="owdev-apigateway-6d8b89b6c-zk8rs"} 0
kepler_container_core_joules_total{command="",container_name="autoscaler",container_namespace="kube-system",mode="idle",pod_name="coredns-autoscaler-5589fb5654-hc72l"} 0
kepler_container_core_joules_total{command="",container_name="azure-ip-masq-agent",container_namespace="kube-system",mode="idle",pod_name="azure-ip-masq-agent-5l5gm"} 0
kepler_container_core_joules_total{command="",container_name="azuredisk",container_namespace="kube-system",mode="idle",pod_name="csi-azuredisk-node-k25rk"} 0
kepler_container_core_joules_total{command="",container_name="azurefile",container_namespace="kube-system",mode="idle",pod_name="csi-azurefile-node-7k9qt"} 0
kepler_container_core_joules_total{command="",container_name="cloud-node-manager",container_namespace="kube-system",mode="idle",pod_name="cloud-node-manager-m9t7h"} 0
kepler_container_core_joules_total{command="",container_name="config-reloader",container_namespace="monitoring",mode="idle",pod_name="alertmanager-prometheus-kube-prometheus-alertmanager-0"} 0
kepler_container_core_joules_total{command="",container_name="config-reloader",container_namespace="monitoring",mode="idle",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 0
kepler_container_core_joules_total{command="",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-b4854dd98-shw8j"} 0
kepler_container_core_joules_total{command="",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-b4854dd98-wxvlr"} 0
kepler_container_core_joules_total{command="",container_name="couchdb",container_namespace="monitoring",mode="idle",pod_name="owdev-couchdb-7cf946b654-vkmk8"} 0
kepler_container_core_joules_total{command="",container_name="gen-certs",container_namespace="monitoring",mode="idle",pod_name="owdev-gen-certs-s6cp4"} 0
kepler_container_core_joules_total{command="",container_name="grafana",container_namespace="monitoring",mode="idle",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 0
kepler_container_core_joules_total{command="",container_name="grafana-sc-dashboard",container_namespace="monitoring",mode="idle",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 0
kepler_container_core_joules_total{command="",container_name="grafana-sc-datasources",container_namespace="monitoring",mode="idle",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 0
kepler_container_core_joules_total{command="",container_name="init-config-reloader",container_namespace="monitoring",mode="idle",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 0
kepler_container_core_joules_total{command="",container_name="init-couchdb",container_namespace="monitoring",mode="idle",pod_name="owdev-init-couchdb-9vrb7"} 0
kepler_container_core_joules_total{command="",container_name="init-node",container_namespace="server",mode="idle",pod_name="server-registry-cert-setup-sx25m"} 0
kepler_container_core_joules_total{command="",container_name="init-node",container_namespace="monitoring",mode="idle",pod_name="debug-wk948"} 0
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-6fdwf"} 0
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-gt7gl"} 0
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-rjksd"} 0
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-vr42v"} 0
kepler_container_core_joules_total{command="",container_name="invoker",container_namespace="monitoring",mode="idle",pod_name="owdev-invoker-0"} 0
kepler_container_core_joules_total{command="",container_name="kafka",container_namespace="monitoring",mode="idle",pod_name="owdev-kafka-0"} 0
kepler_container_core_joules_total{command="",container_name="kafkaprovider",container_namespace="monitoring",mode="idle",pod_name="owdev-kafkaprovider-69977b75cc-hf27p"} 0
kepler_container_core_joules_total{command="",container_name="kepler-exporter",container_namespace="monitoring",mode="idle",pod_name="kepler-exporter-8x47p"} 0
kepler_container_core_joules_total{command="",container_name="konnectivity-agent",container_namespace="kube-system",mode="idle",pod_name="konnectivity-agent-6fcc478f7d-z57d2"} 0
kepler_container_core_joules_total{command="",container_name="kube-prometheus-stack",container_namespace="monitoring",mode="idle",pod_name="prometheus-kube-prometheus-operator-5fd846f56-fvjcg"} 0
kepler_container_core_joules_total{command="",container_name="kube-proxy",container_namespace="kube-system",mode="idle",pod_name="kube-proxy-t99c7"} 0
kepler_container_core_joules_total{command="",container_name="kube-proxy-bootstrap",container_namespace="kube-system",mode="idle",pod_name="kube-proxy-t99c7"} 0
kepler_container_core_joules_total{command="",container_name="kube-state-metrics",container_namespace="monitoring",mode="idle",pod_name="prometheus-kube-state-metrics-84b79bbdcf-59vj8"} 0
kepler_container_core_joules_total{command="",container_name="liveness-probe",container_namespace="kube-system",mode="idle",pod_name="csi-azuredisk-node-k25rk"} 0
kepler_container_core_joules_total{command="",container_name="liveness-probe",container_namespace="kube-system",mode="idle",pod_name="csi-azurefile-node-7k9qt"} 0
kepler_container_core_joules_total{command="",container_name="metrics-server",container_namespace="kube-system",mode="idle",pod_name="metrics-server-f77b4cd8-46qs7"} 0
kepler_container_core_joules_total{command="",container_name="metrics-server",container_namespace="kube-system",mode="idle",pod_name="metrics-server-f77b4cd8-54gt6"} 0
kepler_container_core_joules_total{command="",container_name="nginx",container_namespace="monitoring",mode="idle",pod_name="owdev-nginx-857fb7dc66-jtngj"} 0
kepler_container_core_joules_total{command="",container_name="node-driver-registrar",container_namespace="kube-system",mode="idle",pod_name="csi-azuredisk-node-k25rk"} 0
kepler_container_core_joules_total{command="",container_name="node-driver-registrar",container_namespace="kube-system",mode="idle",pod_name="csi-azurefile-node-7k9qt"} 0
kepler_container_core_joules_total{command="",container_name="prometheus",container_namespace="monitoring",mode="idle",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 0
kepler_container_core_joules_total{command="",container_name="promtail",container_namespace="monitoring",mode="idle",pod_name="loki-promtail-wlcxl"} 0
kepler_container_core_joules_total{command="",container_name="rabbitmq",container_namespace="server",mode="idle",pod_name="server-broker-0"} 0
kepler_container_core_joules_total{command="",container_name="redis",container_namespace="monitoring",mode="idle",pod_name="owdev-redis-bc89c877-tblns"} 0
kepler_container_core_joules_total{command="",container_name="redis-init",container_namespace="monitoring",mode="idle",pod_name="owdev-redis-bc89c877-tblns"} 0
kepler_container_core_joules_total{command="",container_name="server-authorization",container_namespace="server",mode="idle",pod_name="server-authorization-fbcbcdbb7-6kqnh"} 0
kepler_container_core_joules_total{command="",container_name="server-authorization-database-migration",container_namespace="server",mode="idle",pod_name="server-authorization-fbcbcdbb7-6kqnh"} 0
kepler_container_core_joules_total{command="",container_name="server-filestore",container_namespace="server",mode="idle",pod_name="server-filestore-6686ffc6-b4sft"} 0
kepler_container_core_joules_total{command="",container_name="server-front",container_namespace="server",mode="idle",pod_name="server-front-559f9b597c-h68bz"} 0
kepler_container_core_joules_total{command="",container_name="server-registry",container_namespace="server",mode="idle",pod_name="server-registry-fb6bbcd75-lcrjd"} 0
kepler_container_core_joules_total{command="",container_name="scaphandre",container_namespace="default",mode="idle",pod_name="scaphandre-slbq8"} 0
kepler_container_core_joules_total{command="",container_name="system_processes",container_namespace="system",mode="idle",pod_name="system_processes"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev-invoker-00-1-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev-invoker-00-14-guest-matmul"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev-invoker-00-15-guest-matmul"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev-invoker-00-2-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev2-invoker-00-1-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev2-invoker-00-2-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_name="wait",container_namespace="server",mode="idle",pod_name="server-registry-cert-setup-sx25m"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="idle",pod_name="owdev-alarmprovider-7b6dbf84d9-5x5ls"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="idle",pod_name="owdev-invoker-0"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="idle",pod_name="owdev-kafkaprovider-69977b75cc-hf27p"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="idle",pod_name="owdev-nginx-857fb7dc66-jtngj"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-couchdb",container_namespace="monitoring",mode="idle",pod_name="owdev-controller-0"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-6fdwf"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-gt7gl"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-rjksd"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-vr42v"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-kafka",container_namespace="monitoring",mode="idle",pod_name="owdev-controller-0"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-zookeeper",container_namespace="monitoring",mode="idle",pod_name="owdev-kafka-0"} 0
kepler_container_core_joules_total{command="",container_name="wskadmin",container_namespace="monitoring",mode="idle",pod_name="owdev-wskadmin"} 0
kepler_container_core_joules_total{command="",container_name="alarmprovider",container_namespace="monitoring",mode="dynamic",pod_name="owdev-alarmprovider-7b6dbf84d9-5x5ls"} 19737.4
kepler_container_core_joules_total{command="",container_name="alertmanager",container_namespace="monitoring",mode="dynamic",pod_name="alertmanager-prometheus-kube-prometheus-alertmanager-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="apigateway",container_namespace="monitoring",mode="dynamic",pod_name="owdev-apigateway-6d8b89b6c-zk8rs"} 19737.4
kepler_container_core_joules_total{command="",container_name="autoscaler",container_namespace="kube-system",mode="dynamic",pod_name="coredns-autoscaler-5589fb5654-hc72l"} 19737.4
kepler_container_core_joules_total{command="",container_name="azure-ip-masq-agent",container_namespace="kube-system",mode="dynamic",pod_name="azure-ip-masq-agent-5l5gm"} 19737.4
kepler_container_core_joules_total{command="",container_name="azuredisk",container_namespace="kube-system",mode="dynamic",pod_name="csi-azuredisk-node-k25rk"} 19737.4
kepler_container_core_joules_total{command="",container_name="azurefile",container_namespace="kube-system",mode="dynamic",pod_name="csi-azurefile-node-7k9qt"} 19737.4
kepler_container_core_joules_total{command="",container_name="cloud-node-manager",container_namespace="kube-system",mode="dynamic",pod_name="cloud-node-manager-m9t7h"} 19737.4
kepler_container_core_joules_total{command="",container_name="config-reloader",container_namespace="monitoring",mode="dynamic",pod_name="alertmanager-prometheus-kube-prometheus-alertmanager-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="config-reloader",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-b4854dd98-shw8j"} 19737.4
kepler_container_core_joules_total{command="",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-b4854dd98-wxvlr"} 19737.4
kepler_container_core_joules_total{command="",container_name="couchdb",container_namespace="monitoring",mode="dynamic",pod_name="owdev-couchdb-7cf946b654-vkmk8"} 19737.4
kepler_container_core_joules_total{command="",container_name="gen-certs",container_namespace="monitoring",mode="dynamic",pod_name="owdev-gen-certs-s6cp4"} 19737.4
kepler_container_core_joules_total{command="",container_name="grafana",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 19737.4
kepler_container_core_joules_total{command="",container_name="grafana-sc-dashboard",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 19737.4
kepler_container_core_joules_total{command="",container_name="grafana-sc-datasources",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 19737.4
kepler_container_core_joules_total{command="",container_name="init-config-reloader",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="init-couchdb",container_namespace="monitoring",mode="dynamic",pod_name="owdev-init-couchdb-9vrb7"} 19737.4
kepler_container_core_joules_total{command="",container_name="init-node",container_namespace="server",mode="dynamic",pod_name="server-registry-cert-setup-sx25m"} 19737.4
kepler_container_core_joules_total{command="",container_name="init-node",container_namespace="monitoring",mode="dynamic",pod_name="debug-wk948"} 19737.4
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-6fdwf"} 19737.4
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-gt7gl"} 19737.4
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-rjksd"} 19737.4
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-vr42v"} 19737.4
kepler_container_core_joules_total{command="",container_name="invoker",container_namespace="monitoring",mode="dynamic",pod_name="owdev-invoker-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="kafka",container_namespace="monitoring",mode="dynamic",pod_name="owdev-kafka-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="kafkaprovider",container_namespace="monitoring",mode="dynamic",pod_name="owdev-kafkaprovider-69977b75cc-hf27p"} 19737.4
kepler_container_core_joules_total{command="",container_name="kepler-exporter",container_namespace="monitoring",mode="dynamic",pod_name="kepler-exporter-8x47p"} 19737.4
kepler_container_core_joules_total{command="",container_name="konnectivity-agent",container_namespace="kube-system",mode="dynamic",pod_name="konnectivity-agent-6fcc478f7d-z57d2"} 19737.4
kepler_container_core_joules_total{command="",container_name="kube-prometheus-stack",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-kube-prometheus-operator-5fd846f56-fvjcg"} 19737.4
kepler_container_core_joules_total{command="",container_name="kube-proxy",container_namespace="kube-system",mode="dynamic",pod_name="kube-proxy-t99c7"} 19737.4
kepler_container_core_joules_total{command="",container_name="kube-proxy-bootstrap",container_namespace="kube-system",mode="dynamic",pod_name="kube-proxy-t99c7"} 19737.4
kepler_container_core_joules_total{command="",container_name="kube-state-metrics",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-kube-state-metrics-84b79bbdcf-59vj8"} 19737.4
kepler_container_core_joules_total{command="",container_name="liveness-probe",container_namespace="kube-system",mode="dynamic",pod_name="csi-azuredisk-node-k25rk"} 19737.4
kepler_container_core_joules_total{command="",container_name="liveness-probe",container_namespace="kube-system",mode="dynamic",pod_name="csi-azurefile-node-7k9qt"} 19737.4
kepler_container_core_joules_total{command="",container_name="metrics-server",container_namespace="kube-system",mode="dynamic",pod_name="metrics-server-f77b4cd8-46qs7"} 19737.4
kepler_container_core_joules_total{command="",container_name="metrics-server",container_namespace="kube-system",mode="dynamic",pod_name="metrics-server-f77b4cd8-54gt6"} 19737.4
kepler_container_core_joules_total{command="",container_name="nginx",container_namespace="monitoring",mode="dynamic",pod_name="owdev-nginx-857fb7dc66-jtngj"} 19737.4
kepler_container_core_joules_total{command="",container_name="node-driver-registrar",container_namespace="kube-system",mode="dynamic",pod_name="csi-azuredisk-node-k25rk"} 19737.4
kepler_container_core_joules_total{command="",container_name="node-driver-registrar",container_namespace="kube-system",mode="dynamic",pod_name="csi-azurefile-node-7k9qt"} 19737.4
kepler_container_core_joules_total{command="",container_name="prometheus",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="promtail",container_namespace="monitoring",mode="dynamic",pod_name="loki-promtail-wlcxl"} 19737.4
kepler_container_core_joules_total{command="",container_name="rabbitmq",container_namespace="server",mode="dynamic",pod_name="server-broker-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="redis",container_namespace="monitoring",mode="dynamic",pod_name="owdev-redis-bc89c877-tblns"} 19737.4
kepler_container_core_joules_total{command="",container_name="redis-init",container_namespace="monitoring",mode="dynamic",pod_name="owdev-redis-bc89c877-tblns"} 19737.4
kepler_container_core_joules_total{command="",container_name="server-authorization",container_namespace="server",mode="dynamic",pod_name="server-authorization-fbcbcdbb7-6kqnh"} 19737.4
kepler_container_core_joules_total{command="",container_name="server-authorization-database-migration",container_namespace="server",mode="dynamic",pod_name="server-authorization-fbcbcdbb7-6kqnh"} 19737.4
kepler_container_core_joules_total{command="",container_name="server-filestore",container_namespace="server",mode="dynamic",pod_name="server-filestore-6686ffc6-b4sft"} 19737.4
kepler_container_core_joules_total{command="",container_name="server-front",container_namespace="server",mode="dynamic",pod_name="server-front-559f9b597c-h68bz"} 19737.4
kepler_container_core_joules_total{command="",container_name="server-registry",container_namespace="server",mode="dynamic",pod_name="server-registry-fb6bbcd75-lcrjd"} 19737.4
kepler_container_core_joules_total{command="",container_name="system_processes",container_namespace="system",mode="dynamic",pod_name="system_processes"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev-invoker-00-1-prewarm-nodejs10"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev-invoker-00-14-guest-matmul"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev-invoker-00-15-guest-matmul"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev-invoker-00-2-prewarm-nodejs10"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev2-invoker-00-1-prewarm-nodejs10"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev2-invoker-00-2-prewarm-nodejs10"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait",container_namespace="server",mode="dynamic",pod_name="server-registry-cert-setup-sx25m"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="dynamic",pod_name="owdev-alarmprovider-7b6dbf84d9-5x5ls"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="dynamic",pod_name="owdev-invoker-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="dynamic",pod_name="owdev-kafkaprovider-69977b75cc-hf27p"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="dynamic",pod_name="owdev-nginx-857fb7dc66-jtngj"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-couchdb",container_namespace="monitoring",mode="dynamic",pod_name="owdev-controller-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-6fdwf"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-gt7gl"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-rjksd"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-vr42v"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-kafka",container_namespace="monitoring",mode="dynamic",pod_name="owdev-controller-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-zookeeper",container_namespace="monitoring",mode="dynamic",pod_name="owdev-kafka-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="wskadmin",container_namespace="monitoring",mode="dynamic",pod_name="owdev-wskadmin"} 19737.4
kepler_container_core_joules_total{command="",container_name="scaphandre",container_namespace="default",mode="dynamic",pod_name="scaphandre-slbq8"} 23204.033
@andersonandrei that's a good sign. The kepler metrics are created. Can you see them on prometheus or grafana?
@rootfs , yes, I can see those metrics on Prometheus and Grafana. But shouldn't I worry about the errors with eBPF? They make me wonder if I can get any errors with the estimations.
I0301 14:42:12.708526 1 power.go:64] Not able to obtain power, use estimate method
I0301 14:42:12.711548 1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
I0301 14:42:13.016250 1 exporter.go:168] Initializing the GPU collector
perf_event_open: No such file or directory
I0301 14:42:15.542780 1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542866 1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542922 1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542990 1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
In the first line above, for example, it says : Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
@andersonandrei the errors are benign. Kepler uses different models based on availability of hardware counters, RAPL, etc. On baremetal env where these counters and RAPL are accessible, Kepler uses ratio based model. On VMs where no counters or RAPL, Kepler uses regression based models.
@rootfs, thanks for the details.
About the problem of not seeing the metrics of pods deployed after Kepler's deployment, do you have any idea, please? I'm running a serverless platform on top of AK8, so, for each function that I execute in the platform, a new pod is created. Besides, Kepler does not export metrics for those pods. For instance, let us consider the following actions: 1) to deploy kepler, 2) to create a new pod, and 3) to delete kepler and deploy it again. If I do 1) and 2), Kepler does not export such pod metrics, so I need to do 1), 2) and 3), and still, the problem persists with new pods created after 3).
Do you see the pod metrics from directly query kepler metrics endpoint (i.e. using curl
in kepler pod)?
Kepler does delete inactive pods from time to time though. In this case, the long range memory of Prometheus should be the final source for kepler metrics.
I’m using the Kepler interface though Prometheus, but I try the queries with the Kepler metric endpoint with curl from time to time as well.
In both cases, I can just see new pods’ metrics after steps 1), 2), and 3).
can you share your pod yaml?
One pod example, wskowdev-invoker-00-43-guest-linpack, is:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2023-03-06T13:48:14Z"
labels:
invoker: invoker0
name: wskowdev-invoker-00-43-guest-linpack
openwhisk/action: linpack
openwhisk/namespace: guest
release: owdev
user-action-pod: "true"
name: wskowdev-invoker-00-43-guest-linpack
namespace: monitoring
resourceVersion: "44115468"
uid: 63c7e294-c6e2-49fe-a4db-9d403fdce033
spec:
containers:
- env:
- name: __OW_API_HOST
value: https://ourserver.io:31001
- name: __OW_ALLOW_CONCURRENT
value: "false"
image: andersonandrei/python3action:linpack
imagePullPolicy: IfNotPresent
name: user-action
ports:
- containerPort: 8080
name: action
protocol: TCP
resources:
limits:
memory: 256Mi
requests:
memory: 256Mi
securityContext:
capabilities:
drop:
- NET_RAW
- NET_ADMIN
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-cvxfp
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: aks-intra-99364876-vmss000000
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
volumes:
- name: kube-api-access-cvxfp
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-03-06T13:48:14Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-03-06T13:48:15Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-03-06T13:48:15Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-03-06T13:48:14Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://fdf5ec08f8821455a548ba5bb6a025a7aaca26fb8806d493bdca060e45557218
image: docker.io/andersonandrei/python3action:linpack
imageID: docker.io/andersonandrei/python3action@sha256:c1292175aa3129f1fa8cec1e39017c8a00a3244cbf3900bb79b4a794a27bbe7e
lastState: {}
name: user-action
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-03-06T13:48:15Z"
hostIP: 10.224.0.4
phase: Running
podIP: 10.244.0.100
podIPs:
- ip: 10.244.0.100
qosClass: Burstable
startTime: "2023-03-06T13:48:14Z"
And the Kepler pod is:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2023-03-03T14:34:00Z"
generateName: kepler-exporter-
labels:
app: kepler-exporter-service
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
controller-revision-hash: 579564b6b8
pod-template-generation: "1"
sustainable-computing.io/app: kepler
name: kepler-exporter-jq9nb
namespace: monitoring
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: kepler-exporter
uid: 4624edc0-57bf-4824-b0f5-20a1a6584a1f
resourceVersion: "43091573"
uid: 635939c6-3b0e-4902-b339-67907ff2f88d
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- aks-intra-99364876-vmss000001
containers:
- args:
- /usr/bin/kepler -v=1
command:
- /bin/sh
- -c
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: quay.io/sustainable_computing_io/kepler:release-0.4
imagePullPolicy: Always
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthz
port: 9102
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 10
name: kepler-exporter
ports:
- containerPort: 9102
name: http
protocol: TCP
resources:
requests:
cpu: 100m
memory: 400Mi
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /lib/modules
name: lib-modules
- mountPath: /sys
name: tracing
- mountPath: /proc
name: proc
- mountPath: /etc/config
name: cfm
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-kks8t
readOnly: true
dnsPolicy: ClusterFirstWithHostNet
enableServiceLinks: true
nodeName: aks-intra-99364876-vmss000001
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: kepler-sa
serviceAccountName: kepler-sa
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
volumes:
- hostPath:
path: /lib/modules
type: Directory
name: lib-modules
- hostPath:
path: /sys
type: Directory
name: tracing
- hostPath:
path: /proc
type: Directory
name: proc
- configMap:
defaultMode: 420
name: kepler-cfm
name: cfm
- name: kube-api-access-kks8t
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-03-03T14:34:00Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-03-03T14:34:02Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-03-03T14:34:02Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-03-03T14:34:00Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://fa495a86a17d5ca721517533fbbb5659d718f942d61ffbbd383a91befbf3a6de
image: quay.io/sustainable_computing_io/kepler:release-0.4
imageID: quay.io/sustainable_computing_io/kepler@sha256:67c34e1ade5f17cc444aa134f7d95b424077af6bc7c05d2ff82d536a3e0a6174
lastState: {}
name: kepler-exporter
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-03-03T14:34:02Z"
hostIP: 10.224.0.5
phase: Running
podIP: 10.244.1.89
podIPs:
- ip: 10.244.1.89
qosClass: Burstable
startTime: "2023-03-03T14:34:00Z"
Thanks!
Hello,
Please, do you have any updates about this issue? Or should I open a new one to discuss the last messages above?
In addition to the message above, I also tried to use different versions of Kepler, changing the tags at image: quay.io/sustainable_computing_io/kepler:release-0.4
, but it did not work as well. I tried latest, release-0.4, and v0.3.
Thanks!
@andersonandrei yes, please open an issue to track this issue. Thanks
Describe the bug I'm working with Azure AK8 (VMs) and I'm getting errors related to msr and perf events. Also, even when I see a few metric values, they are mostly 0.
To Reproduce Steps to reproduce the behavior:
I0301 14:42:12.695339 1 gpu_nvml.go:45] could not init nvml:
Failed to init nvml: could not init nvml: , using dummy source to obtain gpu power
I0301 14:42:12.708031 1 exporter.go:150] Kepler running on version: 0d3e6ce
I0301 14:42:12.708117 1 config.go:153] using gCgroup ID in the BPF program: true
I0301 14:42:12.708205 1 config.go:154] kernel version: 5.4
I0301 14:42:12.708271 1 config.go:172] EnabledGPU: true
I0301 14:42:12.708435 1 rapl_msr_util.go:143] failed to open path /dev/cpu/1/msr: no such file or directory
I0301 14:42:12.708526 1 power.go:64] Not able to obtain power, use estimate method
I0301 14:42:12.711548 1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
I0301 14:42:13.016250 1 exporter.go:168] Initializing the GPU collector
perf_event_open: No such file or directory
I0301 14:42:15.542780 1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542866 1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542922 1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542990 1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
I0301 14:42:15.543017 1 bcc_attacher.go:132] Successfully load eBPF module with option: [-DNUM_CPUS=2]
I0301 14:42:15.601337 1 exporter.go:210] Started Kepler in 2.89332793s