sustainable-computing-io / kepler-model-server

Model Server for Kepler
Apache License 2.0
23 stars 25 forks source link

No data when i deploy on openshift #97

Open feven-NIT opened 1 year ago

feven-NIT commented 1 year ago

Describe the bug

I'm trying to install kepler on openshift (4.12 and kube 1.25), But i get no data in the grafana and when i take a look on /metrics most of the value are equal to zeros.

Here are the log on the kepler exporter

I0523 10:03:56.383335       1 exporter.go:148] Kepler running on version: dbbfa38
I0523 10:03:56.383353       1 config.go:172] using gCgroup ID in the BPF program: true
I0523 10:03:56.383388       1 config.go:174] kernel version: 4.18
I0523 10:03:56.383412       1 exporter.go:161] EnabledBPFBatchDelete: true
I0523 10:03:56.383423       1 config.go:113] ENABLE_EBPF_CGROUPID: true
I0523 10:03:56.383428       1 config.go:114] ENABLE_GPU: true
I0523 10:03:56.383433       1 config.go:115] ENABLE_PROCESS_METRICS: false
I0523 10:03:56.383438       1 config.go:116] EXPOSE_HW_COUNTER_METRICS: true
I0523 10:03:56.383443       1 config.go:117] EXPOSE_CGROUP_METRICS: true
I0523 10:03:56.383448       1 config.go:118] EXPOSE_KUBELET_METRICS: true
I0523 10:03:56.383454       1 config.go:119] EXPOSE_IRQ_COUNTER_METRICS: true
I0523 10:03:56.383528       1 power.go:77] Not able to obtain power, use estimate method
I0523 10:03:56.383538       1 bcc_attacher.go:165] hardeware counter metrics config true
I0523 10:03:56.383546       1 bcc_attacher.go:183] irq counter metrics config true
I0523 10:03:56.410699       1 utils.go:56] Available ebpf metrics: [cpu_time irq_net_tx irq_net_rx irq_block]
I0523 10:03:56.410731       1 utils.go:57] Available counter metrics: [cpu_cycles cpu_ref_cycles cpu_instr cache_miss]
I0523 10:03:56.410738       1 utils.go:58] Available cgroup metrics from cgroup: [cgroupfs_memory_usage_bytes cgroupfs_kernel_memory_usage_bytes cgroupfs_tcp_memory_usage_bytes cgroupfs_cpu_usage_us cgroupfs_system_cpu_usage_us cgroupfs_user_cpu_usage_us cgroupfs_ioread_bytes cgroupfs_iowrite_bytes block_devices_used]
I0523 10:03:56.410766       1 utils.go:59] Available cgroup metrics from kubelet: [container_cpu_usage_seconds_total container_memory_working_set_bytes]
I0523 10:03:56.410849       1 model.go:87] Model Config NODE_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0523 10:03:56.419837       1 lr.go:159] LR Model (AbsModelWeight): getWeightFromServer: <nil>
I0523 10:03:56.419854       1 lr.go:173] LR Model (AbsModelWeight): status not ok: 400 BAD REQUEST ({ [cpu_time irq_net_tx irq_net_rx irq_block cpu_cycles cpu_ref_cycles cpu_instr cache_miss cgroupfs_memory_usage_bytes cgroupfs_kernel_memory_usage_bytes cgroupfs_tcp_memory_usage_bytes cgroupfs_cpu_usage_us cgroupfs_system_cpu_usage_us cgroupfs_user_cpu_usage_us cgroupfs_ioread_bytes cgroupfs_iowrite_bytes block_devices_used container_cpu_usage_seconds_total container_memory_working_set_bytes block_devices_used cpu_architecture]  AbsModelWeight})
I0523 10:03:56.419864       1 model.go:79] Model AbsModelWeight initiated (false)
I0523 10:03:56.419874       1 model.go:87] Model Config NODE_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0523 10:03:56.423301       1 lr.go:159] LR Model (AbsComponentModelWeight): getWeightFromServer: <nil>
I0523 10:03:56.423674       1 lr.go:164] LR Model (AbsComponentModelWeight): loadWeightFromURLorLocal(/var/lib/kepler/data/KerasCompWeightFullPipeline.json): map[core:{{17.49994468688965 map[cpu_architecture:map[Alder Lake:{0.5408945679664612} Broadwell:{17.9639892578125} Cascade Lake:{-0.49166440963745117} Coffee Lake:{0.5166589617729187} Haswell:{-0.5789095163345337} Ivy Bridge:{-0.024028241634368896} Sandy Bridge:{0.5239214301109314} Sky Lake:{0.4193417429924011}]] map[cpu_cycles:{6.85713664e+09 7.560771917192364e+18 -0.11352460086345673} cpu_instr:{3.374244864e+09 8.408530291701842e+17 -0.414739191532135} cpu_time:{192019.5 5.2761312e+08 -0.06457684189081192}]}} dram:{{17.49994468688965 map[cpu_architecture:map[Alder Lake:{-0.11559933423995972} Broadwell:{16.972564697265625} Cascade Lake:{0.5505847334861755} Coffee Lake:{-0.4564790725708008} Haswell:{-0.13912856578826904} Ivy Bridge:{-0.018331050872802734} Sandy Bridge:{-0.6695247888565063} Sky Lake:{0.29698115587234497}]] map[cache_miss:{9.329199e+06 2.2145245642752e+13 -0.4680119752883911} container_memory_working_set_bytes:{253952 1.93474854912e+11 0.6805805563926697}]}}]
I0523 10:03:56.423768       1 model.go:79] Model AbsComponentModelWeight initiated (true)
I0523 10:03:56.423782       1 model.go:87] Model Config CONTAINER_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0523 10:03:56.427059       1 lr.go:159] LR Model (DynModelWeight): getWeightFromServer: <nil>
I0523 10:03:56.427073       1 lr.go:173] LR Model (DynModelWeight): status not ok: 400 BAD REQUEST ({ [cpu_time irq_net_tx irq_net_rx irq_block cpu_cycles cpu_ref_cycles cpu_instr cache_miss cgroupfs_memory_usage_bytes cgroupfs_kernel_memory_usage_bytes cgroupfs_tcp_memory_usage_bytes cgroupfs_cpu_usage_us cgroupfs_system_cpu_usage_us cgroupfs_user_cpu_usage_us cgroupfs_ioread_bytes cgroupfs_iowrite_bytes block_devices_used container_cpu_usage_seconds_total container_memory_working_set_bytes block_devices_used cpu_architecture]  DynModelWeight})
I0523 10:03:56.427082       1 model.go:79] Model DynModelWeight initiated (false)
I0523 10:03:56.427091       1 model.go:87] Model Config CONTAINER_COMPONENTS: {UseEstimatorSidecar:true SelectedModel: SelectFilter: InitModelURL:https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CgroupOnly/ScikitMixed.zip}
I0523 10:03:56.427211       1 estimate.go:101] dial error: dial unix /tmp/estimator.sock: connect: no such file or directory
I0523 10:03:56.427228       1 model.go:61] Model DynComponentPower initiated (false)
I0523 10:03:56.427236       1 model.go:87] Model Config PROCESS_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0523 10:03:56.430278       1 lr.go:159] LR Model (DynModelWeight): getWeightFromServer: <nil>
I0523 10:03:56.430290       1 lr.go:173] LR Model (DynModelWeight): status not ok: 400 BAD REQUEST ({ [cpu_time irq_net_tx irq_net_rx irq_block cpu_cycles cpu_ref_cycles cpu_instr cache_miss cgroupfs_memory_usage_bytes cgroupfs_kernel_memory_usage_bytes cgroupfs_tcp_memory_usage_bytes cgroupfs_cpu_usage_us cgroupfs_system_cpu_usage_us cgroupfs_user_cpu_usage_us cgroupfs_ioread_bytes cgroupfs_iowrite_bytes block_devices_used container_cpu_usage_seconds_total container_memory_working_set_bytes block_devices_used cpu_architecture]  DynModelWeight})
I0523 10:03:56.430296       1 model.go:79] Model DynModelWeight initiated (false)
I0523 10:03:56.430308       1 model.go:87] Model Config PROCESS_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0523 10:03:56.433609       1 lr.go:159] LR Model (DynComponentModelWeight): getWeightFromServer: <nil>
I0523 10:03:56.719544       1 lr.go:164] LR Model (DynComponentModelWeight): loadWeightFromURLorLocal(https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CounterOnly/ScikitMixed/ScikitMixed.json): map[dram:{{0.8318076441807906 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0.22602678738192125} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 0.146880994775066} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 0}]}} pkg:{{24.388564716241596 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 15.858373957810427} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 8.25749138735891}]}}]
I0523 10:03:56.719647       1 model.go:79] Model DynComponentModelWeight initiated (true)
I0523 10:03:56.719660       1 model.go:87] Model Config PROCESS_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0523 10:03:56.724571       1 lr.go:159] LR Model (DynModelWeight): getWeightFromServer: <nil>
I0523 10:03:56.724628       1 lr.go:173] LR Model (DynModelWeight): status not ok: 400 BAD REQUEST ({ [cpu_time irq_net_tx irq_net_rx irq_block cpu_cycles cpu_ref_cycles cpu_instr cache_miss cgroupfs_memory_usage_bytes cgroupfs_kernel_memory_usage_bytes cgroupfs_tcp_memory_usage_bytes cgroupfs_cpu_usage_us cgroupfs_system_cpu_usage_us cgroupfs_user_cpu_usage_us cgroupfs_ioread_bytes cgroupfs_iowrite_bytes block_devices_used container_cpu_usage_seconds_total container_memory_working_set_bytes block_devices_used cpu_architecture]  DynModelWeight})
I0523 10:03:56.724643       1 model.go:79] Model DynModelWeight initiated (false)
I0523 10:03:56.724654       1 model.go:87] Model Config PROCESS_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0523 10:03:56.728452       1 lr.go:159] LR Model (DynComponentModelWeight): getWeightFromServer: <nil>
I0523 10:03:56.736924       1 lr.go:164] LR Model (DynComponentModelWeight): loadWeightFromURLorLocal(https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CounterOnly/ScikitMixed/ScikitMixed.json): map[dram:{{0.8318076441807906 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0.22602678738192125} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 0.146880994775066} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 0}]}} pkg:{{24.388564716241596 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 15.858373957810427} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 8.25749138735891}]}}]
I0523 10:03:56.736959       1 model.go:79] Model DynComponentModelWeight initiated (true)
I0523 10:03:56.736970       1 exporter.go:174] Initializing the GPU collector
I0523 10:03:56.737214       1 acpi.go:75] Using the ACPI power meter path: /sys/class/hwmon/hwmon2/device/
perf_event_open: No such file or directory
I0523 10:03:57.711848       1 bcc_attacher.go:106] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0523 10:03:57.712006       1 bcc_attacher.go:106] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0523 10:03:57.712128       1 bcc_attacher.go:106] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0523 10:03:57.712268       1 bcc_attacher.go:106] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
I0523 10:03:57.712306       1 bcc_attacher.go:153] Successfully load eBPF module with option: [-DNUM_CPUS=4 -DSET_GROUP_ID]
I0523 10:03:57.754659       1 node_energy_collector.go:60] Node components power model collection is supported
I0523 10:03:57.754930       1 exporter.go:218] Started Kepler in 1.371618738s
I0523 10:04:00.860300       1 container_hc_collector.go:134] failed to resolve container for cGroup ID 4294981512: process is not in a kubernetes pod, set containerID=system_processes
I0523 10:04:00.993350       1 container_hc_collector.go:134] failed to resolve container for cGroup ID 4294974871: process is not in a kubernetes pod, set containerID=system_processes
I0523 10:04:01.019998       1 container_hc_collector.go:134] failed to resolve container for cGroup ID 4294975712: process is not in a kubernetes pod, set containerID=system_processes
I0523 10:04:01.102511       1 container_hc_collector.go:134] failed to resolve container for cGroup ID 4295033611: process is not in a kubernetes pod, set containerID=system_processes
I0523 10:04:01.176964       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.177757       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.177767       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.178590       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.179514       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.179528       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.179532       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.179535       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.179537       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.179539       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.179543       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.179546       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.179548       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.181371       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.185914       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.188130       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.189089       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.189100       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.189955       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.189966       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.191620       1 container_cgroup_collector.go:38] Error: could not start cgroup stat handler for PID: 309259
I0523 10:04:01.194300       1 container_cgroup_collector.go:38] Error: could not start cgroup stat handler for PID: 309271
I0523 10:04:01.195116       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.198452       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.201074       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.201085       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.201091       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.201096       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.201980       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.201991       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.201996       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.202870       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.204190       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.205115       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.205174       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.211898       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.212711       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.212722       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.218457       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.220914       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.221825       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.221836       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.221839       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.226704       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.226714       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.226717       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.226720       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.229332       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.233212       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.233225       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.233228       1 container_cgroup_collector.go:32] PID 0 does not have cgroup metrics since there is no /proc/0/cgroup
I0523 10:04:01.240719       1 container_cgroup_collector.go:62] Kubelet Read: map[openshift-apiserver/apiserver-c4db8446d-zr7qr/openshift-apiserver:132.57998 openshift-apiserver/apiserver-c4db8446d-zr7qr/openshift-apiserver-check-endpoints:21.056607 openshift-authentication-operator/authentication-operator-559895d8c-t5slk/authentication-operator:168.791428 openshift-authentication/oauth-openshift-685c7fb97c-8vwkd/oauth-openshift:38.823024 openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator-8588448f96-z6222/cluster-cloud-controller-manager:3.80464 openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator-8588448f96-z6222/config-sync-controllers:5.076573 openshift-cloud-credential-operator/cloud-credential-operator-7c5fbcd9f9-9ngkl/cloud-credential-operator:16.925303 openshift-cloud-credential-operator/cloud-credential-operator-7c5fbcd9f9-9ngkl/kube-rbac-proxy:1.468138 openshift-cloud-credential-operator/pod-identity-webhook-5f5ccf8564-bpl5l/pod-identity-webhook:3.460686 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/attacher-kube-rbac-proxy:1.325304 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-attacher:2.064806 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-driver:2.462634 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-liveness-probe:2.007424 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-provisioner:4.877118 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-resizer:4.996134 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-snapshotter:1.928026 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/driver-kube-rbac-proxy:1.309516 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/provisioner-kube-rbac-proxy:1.417524 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/resizer-kube-rbac-proxy:1.317635 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/snapshotter-kube-rbac-proxy:1.317931 openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-wlppl/csi-driver:1.305996 openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-wlppl/csi-liveness-probe:1.854794 openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-wlppl/csi-node-driver-registrar:0.21039 openshift-cluster-csi-drivers/aws-ebs-csi-driver-operator-74dbf4db75-cb2jf/aws-ebs-csi-driver-operator:40.764774 openshift-cluster-node-tuning-operator/tuned-96g9n/tuned:16.377223 openshift-cluster-samples-operator/cluster-samples-operator-9cf74f8bc-m2x9c/cluster-samples-operator:9.377946 openshift-cluster-samples-operator/cluster-samples-operator-9cf74f8bc-m2x9c/cluster-samples-operator-watch:6.05407 openshift-cluster-storage-operator/csi-snapshot-controller-56fd94fb5f-9kdld/snapshot-controller:1.479484 openshift-cluster-storage-operator/csi-snapshot-webhook-6cd6d6996-cg7zd/webhook:1.613762 openshift-console-operator/console-operator-7569c48cbc-cb5xb/console-operator:69.223751 openshift-console-operator/console-operator-7569c48cbc-cb5xb/conversion-webhook-server:11.486811 openshift-console/console-84fd89fc79-9zntv/console:16.970025 openshift-controller-manager/controller-manager-9d996b467-kwqt5/controller-manager:4.268512 openshift-dns/dns-default-dbr8c/dns:24.869909 openshift-dns/dns-default-dbr8c/kube-rbac-proxy:1.334247 openshift-dns/node-resolver-jvtfq/dns-node-resolver:2.932378 openshift-etcd-operator/etcd-operator-7765f88568-bhmbf/etcd-operator:137.630761 openshift-etcd/etcd-guard-ip-10-0-250-156.us-east-2.compute.internal/guard:0.024585 openshift-etcd/etcd-ip-10-0-250-156.us-east-2.compute.internal/etcd:921.646186 openshift-etcd/etcd-ip-10-0-250-156.us-east-2.compute.internal/etcd-metrics:109.529524 openshift-etcd/etcd-ip-10-0-250-156.us-east-2.compute.internal/etcd-readyz:93.266598 openshift-etcd/etcd-ip-10-0-250-156.us-east-2.compute.internal/etcdctl:0.018714 openshift-image-registry/node-ca-qs6zl/node-ca:3.695663 openshift-kube-apiserver-operator/kube-apiserver-operator-5455b878cb-sqv4x/kube-apiserver-operator:232.442618 openshift-kube-apiserver/kube-apiserver-guard-ip-10-0-250-156.us-east-2.compute.internal/guard:0.024472 openshift-kube-apiserver/kube-apiserver-ip-10-0-250-156.us-east-2.compute.internal/kube-apiserver:757.854705 openshift-kube-apiserver/kube-apiserver-ip-10-0-250-156.us-east-2.compute.internal/kube-apiserver-cert-regeneration-controller:0.953418 openshift-kube-apiserver/kube-apiserver-ip-10-0-250-156.us-east-2.compute.internal/kube-apiserver-cert-syncer:2.650122 openshift-kube-apiserver/kube-apiserver-ip-10-0-250-156.us-east-2.compute.internal/kube-apiserver-check-endpoints:12.592517 openshift-kube-apiserver/kube-apiserver-ip-10-0-250-156.us-east-2.compute.internal/kube-apiserver-insecure-readyz:0.322722 openshift-kube-controller-manager/kube-controller-manager-guard-ip-10-0-250-156.us-east-2.compute.internal/guard:0.030397 openshift-kube-controller-manager/kube-controller-manager-ip-10-0-250-156.us-east-2.compute.internal/cluster-policy-controller:25.916049 openshift-kube-controller-manager/kube-controller-manager-ip-10-0-250-156.us-east-2.compute.internal/kube-controller-manager:187.308156 openshift-kube-controller-manager/kube-controller-manager-ip-10-0-250-156.us-east-2.compute.internal/kube-controller-manager-cert-syncer:8.090033 openshift-kube-controller-manager/kube-controller-manager-ip-10-0-250-156.us-east-2.compute.internal/kube-controller-manager-recovery-controller:14.495521 openshift-kube-scheduler/openshift-kube-scheduler-guard-ip-10-0-250-156.us-east-2.compute.internal/guard:0.02667 openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-250-156.us-east-2.compute.internal/kube-scheduler:48.052504 openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-250-156.us-east-2.compute.internal/kube-scheduler-cert-syncer:6.839948 openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-250-156.us-east-2.compute.internal/kube-scheduler-recovery-controller:2.17563 openshift-machine-config-operator/machine-config-controller-855d79c994-hrlcm/machine-config-controller:16.654667 openshift-machine-config-operator/machine-config-controller-855d79c994-hrlcm/oauth-proxy:3.759957 openshift-machine-config-operator/machine-config-daemon-5wrfw/machine-config-daemon:4.672102 openshift-machine-config-operator/machine-config-daemon-5wrfw/oauth-proxy:3.246898 openshift-machine-config-operator/machine-config-server-rfsqc/machine-config-server:10.492191 openshift-marketplace/community-operators-r2d27/registry-server:166.69901 openshift-marketplace/marketplace-operator-85766d6dd7-hbzq6/marketplace-operator:13.450753 openshift-marketplace/redhat-marketplace-mqsc2/registry-server:150.220435 openshift-marketplace/redhat-operators-j8rrx/registry-server:79.640585 openshift-monitoring/node-exporter-hsv74/kube-rbac-proxy:4.304995 openshift-monitoring/node-exporter-hsv74/node-exporter:50.62079 openshift-monitoring/prometheus-operator-cc85dbb9d-z4c6c/kube-rbac-proxy:3.505821 openshift-monitoring/prometheus-operator-cc85dbb9d-z4c6c/prometheus-operator:9.853576 openshift-multus/multus-additional-cni-plugins-wf7wx/kube-multus-additional-cni-plugins:0.031947 openshift-multus/multus-admission-controller-fd996498f-v9slt/kube-rbac-proxy:1.300067 openshift-multus/multus-admission-controller-fd996498f-v9slt/multus-admission-controller:4.75867 openshift-multus/multus-f78p6/kube-multus:128.749864 openshift-multus/network-metrics-daemon-rz2rv/kube-rbac-proxy:2.32394 openshift-multus/network-metrics-daemon-rz2rv/network-metrics-daemon:5.82583 openshift-network-diagnostics/network-check-target-tp48m/network-check-target-container:0.525592 openshift-network-operator/network-operator-766956b564-rxr4c/network-operator:65.025713 openshift-oauth-apiserver/apiserver-775f4b8b55-767mt/oauth-apiserver:256.136645 openshift-operator-lifecycle-manager/packageserver-57d456d546-9g7ws/packageserver:194.012641 openshift-route-controller-manager/route-controller-manager-5f8b5f6ccd-984p7/route-controller-manager:4.038294 openshift-sdn/sdn-controller-4tbbh/kube-rbac-proxy:1.275303 openshift-sdn/sdn-controller-4tbbh/sdn-controller:1.506999 openshift-sdn/sdn-w2kws/kube-rbac-proxy:1.387604 openshift-sdn/sdn-w2kws/sdn:80.785461 openshift-service-ca/service-ca-5556ff5b86-r6nhj/service-ca-controller:39.395993 system/system_processes:5402.717384], map[openshift-apiserver/apiserver-c4db8446d-zr7qr/openshift-apiserver:2.51510784e+08 openshift-apiserver/apiserver-c4db8446d-zr7qr/openshift-apiserver-check-endpoints:6.5191936e+07 openshift-authentication-operator/authentication-operator-559895d8c-t5slk/authentication-operator:1.73248512e+08 openshift-authentication/oauth-openshift-685c7fb97c-8vwkd/oauth-openshift:6.9844992e+07 openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator-8588448f96-z6222/cluster-cloud-controller-manager:4.7230976e+07 openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator-8588448f96-z6222/config-sync-controllers:5.3948416e+07 openshift-cloud-credential-operator/cloud-credential-operator-7c5fbcd9f9-9ngkl/cloud-credential-operator:1.42901248e+08 openshift-cloud-credential-operator/cloud-credential-operator-7c5fbcd9f9-9ngkl/kube-rbac-proxy:1.8087936e+07 openshift-cloud-credential-operator/pod-identity-webhook-5f5ccf8564-bpl5l/pod-identity-webhook:3.0380032e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/attacher-kube-rbac-proxy:2.043904e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-attacher:4.4417024e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-driver:2.7418624e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-liveness-probe:1.6818176e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-provisioner:5.4407168e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-resizer:6.4331776e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/csi-snapshotter:4.2754048e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/driver-kube-rbac-proxy:2.0062208e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/provisioner-kube-rbac-proxy:2.277376e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/resizer-kube-rbac-proxy:1.9222528e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-5c7d75767d-96wsv/snapshotter-kube-rbac-proxy:1.9795968e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-wlppl/csi-driver:4.0681472e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-wlppl/csi-liveness-probe:2.357248e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-wlppl/csi-node-driver-registrar:1.8018304e+07 openshift-cluster-csi-drivers/aws-ebs-csi-driver-operator-74dbf4db75-cb2jf/aws-ebs-csi-driver-operator:1.099776e+08 openshift-cluster-node-tuning-operator/tuned-96g9n/tuned:6.98368e+07 openshift-cluster-samples-operator/cluster-samples-operator-9cf74f8bc-m2x9c/cluster-samples-operator:7.0946816e+07 openshift-cluster-samples-operator/cluster-samples-operator-9cf74f8bc-m2x9c/cluster-samples-operator-watch:2.537472e+07 openshift-cluster-storage-operator/csi-snapshot-controller-56fd94fb5f-9kdld/snapshot-controller:3.3857536e+07 openshift-cluster-storage-operator/csi-snapshot-webhook-6cd6d6996-cg7zd/webhook:3.4480128e+07 openshift-console-operator/console-operator-7569c48cbc-cb5xb/console-operator:1.1073536e+08 openshift-console-operator/console-operator-7569c48cbc-cb5xb/conversion-webhook-server:2.5858048e+07 openshift-console/console-84fd89fc79-9zntv/console:7.4952704e+07 openshift-controller-manager/controller-manager-9d996b467-kwqt5/controller-manager:5.011456e+07 openshift-dns/dns-default-dbr8c/dns:7.604224e+07 openshift-dns/dns-default-dbr8c/kube-rbac-proxy:2.0221952e+07 openshift-dns/node-resolver-jvtfq/dns-node-resolver:7.999488e+06 openshift-etcd-operator/etcd-operator-7765f88568-bhmbf/etcd-operator:1.25865984e+08 openshift-etcd/etcd-guard-ip-10-0-250-156.us-east-2.compute.internal/guard:880640 openshift-etcd/etcd-ip-10-0-250-156.us-east-2.compute.internal/etcd:1.539158016e+09 openshift-etcd/etcd-ip-10-0-250-156.us-east-2.compute.internal/etcd-metrics:4.450304e+07 openshift-etcd/etcd-ip-10-0-250-156.us-east-2.compute.internal/etcd-readyz:7.4776576e+07 openshift-etcd/etcd-ip-10-0-250-156.us-east-2.compute.internal/etcdctl:880640 openshift-image-registry/node-ca-qs6zl/node-ca:1.712128e+06 openshift-kube-apiserver-operator/kube-apiserver-operator-5455b878cb-sqv4x/kube-apiserver-operator:1.6461824e+08 openshift-kube-apiserver/kube-apiserver-guard-ip-10-0-250-156.us-east-2.compute.internal/guard:864256 openshift-kube-apiserver/kube-apiserver-ip-10-0-250-156.us-east-2.compute.internal/kube-apiserver:2.409766912e+09 openshift-kube-apiserver/kube-apiserver-ip-10-0-250-156.us-east-2.compute.internal/kube-apiserver-cert-regeneration-controller:2.584576e+07 openshift-kube-apiserver/kube-apiserver-ip-10-0-250-156.us-east-2.compute.internal/kube-apiserver-cert-syncer:3.4316288e+07 openshift-kube-apiserver/kube-apiserver-ip-10-0-250-156.us-east-2.compute.internal/kube-apiserver-check-endpoints:6.014976e+07 openshift-kube-apiserver/kube-apiserver-ip-10-0-250-156.us-east-2.compute.internal/kube-apiserver-insecure-readyz:2.695168e+07 openshift-kube-controller-manager/kube-controller-manager-guard-ip-10-0-250-156.us-east-2.compute.internal/guard:864256 openshift-kube-controller-manager/kube-controller-manager-ip-10-0-250-156.us-east-2.compute.internal/cluster-policy-controller:7.7406208e+07 openshift-kube-controller-manager/kube-controller-manager-ip-10-0-250-156.us-east-2.compute.internal/kube-controller-manager:3.19586304e+08 openshift-kube-controller-manager/kube-controller-manager-ip-10-0-250-156.us-east-2.compute.internal/kube-controller-manager-cert-syncer:6.1550592e+07 openshift-kube-controller-manager/kube-controller-manager-ip-10-0-250-156.us-east-2.compute.internal/kube-controller-manager-recovery-controller:5.5537664e+07 openshift-kube-scheduler/openshift-kube-scheduler-guard-ip-10-0-250-156.us-east-2.compute.internal/guard:872448 openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-250-156.us-east-2.compute.internal/kube-scheduler:8.9739264e+07 openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-250-156.us-east-2.compute.internal/kube-scheduler-cert-syncer:5.9588608e+07 openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-250-156.us-east-2.compute.internal/kube-scheduler-recovery-controller:2.8925952e+07 openshift-machine-config-operator/machine-config-controller-855d79c994-hrlcm/machine-config-controller:7.1213056e+07 openshift-machine-config-operator/machine-config-controller-855d79c994-hrlcm/oauth-proxy:2.4711168e+07 openshift-machine-config-operator/machine-config-daemon-5wrfw/machine-config-daemon:1.14659328e+08 openshift-machine-config-operator/machine-config-daemon-5wrfw/oauth-proxy:4.186112e+07 openshift-machine-config-operator/machine-config-server-rfsqc/machine-config-server:3.4230272e+07 openshift-marketplace/community-operators-r2d27/registry-server:7.4698752e+07 openshift-marketplace/marketplace-operator-85766d6dd7-hbzq6/marketplace-operator:8.1764352e+07 openshift-marketplace/redhat-marketplace-mqsc2/registry-server:1.67768064e+08 openshift-marketplace/redhat-operators-j8rrx/registry-server:4.601856e+07 openshift-monitoring/node-exporter-hsv74/kube-rbac-proxy:2.271232e+07 openshift-monitoring/node-exporter-hsv74/node-exporter:3.3349632e+07 openshift-monitoring/prometheus-operator-cc85dbb9d-z4c6c/kube-rbac-proxy:1.9427328e+07 openshift-monitoring/prometheus-operator-cc85dbb9d-z4c6c/prometheus-operator:1.03907328e+08 openshift-multus/multus-additional-cni-plugins-wf7wx/kube-multus-additional-cni-plugins:991232 openshift-multus/multus-admission-controller-fd996498f-v9slt/kube-rbac-proxy:1.9869696e+07 openshift-multus/multus-admission-controller-fd996498f-v9slt/multus-admission-controller:4.5645824e+07 openshift-multus/multus-f78p6/kube-multus:3.3189888e+07 openshift-multus/network-metrics-daemon-rz2rv/kube-rbac-proxy:1.9529728e+07 openshift-multus/network-metrics-daemon-rz2rv/network-metrics-daemon:4.8451584e+07 openshift-network-diagnostics/network-check-target-tp48m/network-check-target-container:2.0414464e+07 openshift-network-operator/network-operator-766956b564-rxr4c/network-operator:1.86703872e+08 openshift-oauth-apiserver/apiserver-775f4b8b55-767mt/oauth-apiserver:1.12467968e+08 openshift-operator-lifecycle-manager/packageserver-57d456d546-9g7ws/packageserver:2.49126912e+08 openshift-route-controller-manager/route-controller-manager-5f8b5f6ccd-984p7/route-controller-manager:4.3249664e+07 openshift-sdn/sdn-controller-4tbbh/kube-rbac-proxy:2.7512832e+07 openshift-sdn/sdn-controller-4tbbh/sdn-controller:5.0946048e+07 openshift-sdn/sdn-w2kws/kube-rbac-proxy:3.4435072e+07 openshift-sdn/sdn-w2kws/sdn:1.28462848e+08 openshift-service-ca/service-ca-5556ff5b86-r6nhj/service-ca-controller:1.5030272e+08 system/system_processes:5.39226112e+09]
I0523 10:04:01.241490       1 node_energy_collector.go:60] Node components power model collection is supported
I0523 10:04:01.241865       1 container_power.go:105] No ContainerComponentPower Model
I0523 10:04:01.241879       1 metric_collector.go:137] energy from pod/container (0 active processes): name: etcd-guard-ip-10-0-250-156.us-east-2.compute.internal/guard namespace: openshift-etcd 
    cgrouppid: 0 pid: [] comm: 
    Dyn ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0) 
    Idle ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0) 
    CPUTime:  0 (0)
    NetTX IRQ: 0 (0)
    NetRX IRQ: 0 (0)
    Block IRQ: 0 (0)

Here is an exemple of value in metrics

# HELP kepler_container_core_joules_total Aggregated RAPL value in core in joules
# TYPE kepler_container_core_joules_total counter
kepler_container_core_joules_total{command="",container_id="0cb78fa49d00fe8e8448f52be01f3ff0474ea036069e78a94bae3507ef43bc2f",container_name="bond-cni-plugin",container_namespace="openshift-multus",mode="dynamic",pod_name="multus-additional-cni-plugins-m94zz"} 0
kepler_container_core_joules_total{command="",container_id="0cb78fa49d00fe8e8448f52be01f3ff0474ea036069e78a94bae3507ef43bc2f",container_name="bond-cni-plugin",container_namespace="openshift-multus",mode="idle",pod_name="multus-additional-cni-plugins-m94zz"} 0
kepler_container_core_joules_total{command="",container_id="21479574acd3fcf4d41b431b3afb573877b04f58afcf50045bf7127ae3bd83ed",container_name="init-config-reloader",container_namespace="openshift-monitoring",mode="dynamic",pod_name="prometheus-k8s-1"} 0
kepler_container_core_joules_total{command="",container_id="21479574acd3fcf4d41b431b3afb573877b04f58afcf50045bf7127ae3bd83ed",container_name="init-config-reloader",container_namespace="openshift-monitoring",mode="idle",pod_name="prometheus-k8s-1"} 0
kepler_container_core_joules_total{command="",container_id="44ae0256863adbbc3e1d9d9fc9fa1abeff012b56a6562a6712fa5f5a124d7622",container_name="kube-multus-additional-cni-plugins",container_namespace="openshift-multus",mode="dynamic",pod_name="multus-additional-cni-plugins-m94zz"} 0
kepler_container_core_joules_total{command="",container_id="44ae0256863adbbc3e1d9d9fc9fa1abeff012b56a6562a6712fa5f5a124d7622",container_name="kube-multus-additional-cni-plugins",container_namespace="openshift-multus",mode="idle",pod_name="multus-additional-cni-plugins-m94zz"} 0
kepler_container_core_joules_total{command="",container_id="92315f6b50e6c78cdb7a76883e196ff004adef793196d9092d7d524fc6d710b1",container_name="egress-router-binary-copy",container_namespace="openshift-multus",mode="dynamic",pod_name="multus-additional-cni-plugins-m94zz"} 0
kepler_container_core_joules_total{command="",container_id="92315f6b50e6c78cdb7a76883e196ff004adef793196d9092d7d524fc6d710b1",container_name="egress-router-binary-copy",container_namespace="openshift-multus",mode="idle",pod_name="multus-additional-cni-plugins-m94zz"} 0
kepler_container_core_joules_total{command="",container_id="a8fdff7459dc2d84eeeff84232dc1182fffd600b5474b992c9ca2e1e8865973f",container_name="cni-plugins",container_namespace="openshift-multus",mode="dynamic",pod_name="multus-additional-cni-plugins-m94zz"} 0
kepler_container_core_joules_total{command="",container_id="a8fdff7459dc2d84eeeff84232dc1182fffd600b5474b992c9ca2e1e8865973f",container_name="cni-plugins",container_namespace="openshift-multus",mode="idle",pod_name="multus-additional-cni-plugins-m94zz"} 0

To reproduce

Install kepler on openshift using Using

make build-manifest OPTS="ESTIMATOR_SIDECAR_DEPLOY OPENSHIFT_DEPLOY CLUSTER_PREREQ_DEPLOY MODEL_SERVER_DEPLOY"
feven-NIT commented 1 year ago

And here are my current log in the kepler-model pod:

Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CgroupOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/CgroupOnly/ScikitMixed/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CgroupOnly/ScikitMixed.zip to /data/models/DynComponentPower/CgroupOnly/ScikitMixed.zip
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/BPFOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/BPFOnly/ScikitMixed/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/BPFOnly/ScikitMixed.zip to /data/models/DynComponentPower/BPFOnly/ScikitMixed.zip
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/KubeletOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/KubeletOnly/ScikitMixed/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/IRQOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/IRQOnly/ScikitMixed/metadata.json: 404
* Debugger is active!
* Debugger PIN: 982-944-970
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.2.36 - - [23/May/2023 10:04:01] "POST /model HTTP/1.1" 400 -
rootfs commented 1 year ago

Thank you @feven-redhat for testing this!

The kepler log container_power.go:105] No ContainerComponentPower Model looks interesting, it probably indicates the model is not there.

@feven-redhat can you set EXPOSE_IRQ_COUNTER_METRICS=false in the kepler-cfm configmap and restart kepler?

@sunya-ch @KaiyiLiu1234 does the irq metrics cause trouble in dynamic component power model (since they are not in the original training)?

feven-NIT commented 1 year ago

I have retry the deployment in openshift without the estimator (just with make build-manifest OPTS=" OPENSHIFT_DEPLOY CLUSTER_PREREQ_DEPLOY") and it work. But when i try with the estimator or estimator with model i get the same issue.
Here are my log for kepler-exporter.

I0601 08:04:33.668743       1 node_energy_collector.go:60] Node components power model collection is supported
I0601 08:04:33.669098       1 container_power.go:105] No ContainerComponentPower Model
I0601 08:04:33.669109       1 metric_collector.go:137] energy from pod/container (0 active processes): name: node-exporter-nst9r/init-textfile namespace: openshift-monitoring 
    cgrouppid: 0 pid: [] comm: 
    Dyn ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0) 
    Idle ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0) 
    CPUTime:  0 (0)
    NetTX IRQ: 0 (0)
    NetRX IRQ: 0 (0)
    Block IRQ: 0 (0)
    counters: map[cache_miss:0 (0) cpu_cycles:0 (0) cpu_instr:0 (0) cpu_ref_cycles:0 (0)]
    cgroupfs: map[block_devices_used:0 (0) cgroupfs_cpu_usage_us:0 (0) cgroupfs_ioread_bytes:0 (0) cgroupfs_iowrite_bytes:0 (0) cgroupfs_kernel_memory_usage_bytes:0 (0) cgroupfs_memory_usage_bytes:0 (0) cgroupfs_system_cpu_usage_us:0 (0) cgroupfs_tcp_memory_usage_bytes:0 (0) cgroupfs_user_cpu_usage_us:0 (0)]
    kubelets: map[container_cpu_usage_seconds_total:0 (0) container_memory_working_set_bytes:0 (0)]

I0601 08:04:33.669152       1 metric_collector.go:137] energy from pod/container (1 active processes): name: node-exporter-nst9r/kube-rbac-proxy namespace: openshift-monitoring 
    cgrouppid: 0 pid: [4990] comm: kube-rbac-proxy
    Dyn ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0) 
    Idle ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0) 
    CPUTime:  0 (0)
    NetTX IRQ: 0 (0)
    NetRX IRQ: 0 (0)
    Block IRQ: 0 (0)
    counters: map[cache_miss:0 (0) cpu_cycles:0 (0) cpu_instr:0 (0) cpu_ref_cycles:0 (0)]
    cgroupfs: map[block_devices_used:6 (6) cgroupfs_cpu_usage_us:0 (43600428) cgroupfs_ioread_bytes:0 (0) cgroupfs_iowrite_bytes:0 (0) cgroupfs_kernel_memory_usage_bytes:0 (737280) cgroupfs_memory_usage_bytes:0 (28160000) cgroupfs_system_cpu_usage_us:0 (15900000) cgroupfs_tcp_memory_usage_bytes:0 (0) cgroupfs_user_cpu_usage_us:0 (27690000)]
    kubelets: map[container_cpu_usage_seconds_total:0 (43) container_memory_working_set_bytes:0 (22396928)]

Here are the log from the estimator


2023-06-01 08:04:26.906840: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-01 08:04:26.906873: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

And here are the log from the model

Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentPower/BPFOnly/KerasCompFullPipeline/metadata.json to /data/models/AbsComponentPower/BPFOnly/KerasCompFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentPower/KubeletOnly/KerasCompFullPipeline/metadata.json to /data/models/AbsComponentPower/KubeletOnly/KerasCompFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentPower/IRQOnly/KerasCompFullPipeline/metadata.json to /data/models/AbsComponentPower/IRQOnly/KerasCompFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentPower/CounterIRQCombined/KerasCompFullPipeline/metadata.json to /data/models/AbsComponentPower/CounterIRQCombined/KerasCompFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentPower/Unknown/KerasCompFullPipeline/metadata.json to /data/models/AbsComponentPower/Unknown/KerasCompFullPipeline/metadata.json: 404
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/Full/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/Full/KerasCompWeightFullPipeline/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/Full/KerasCompWeightFullPipeline/KerasCompWeightFullPipeline.json to /data/models/AbsComponentModelWeight/Full/KerasCompWeightFullPipeline/KerasCompWeightFullPipeline.json
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/WorkloadOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/WorkloadOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/CounterOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/CounterOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/CgroupOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/CgroupOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/BPFOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/BPFOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/KubeletOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/KubeletOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/IRQOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/IRQOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/CounterIRQCombined/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/CounterIRQCombined/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/Unknown/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/Unknown/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/Full/ScikitMixed/metadata.json to /data/models/DynComponentPower/Full/ScikitMixed/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/WorkloadOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/WorkloadOnly/ScikitMixed/metadata.json: 404
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CounterOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/CounterOnly/ScikitMixed/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CounterOnly/ScikitMixed.zip to /data/models/DynComponentPower/CounterOnly/ScikitMixed.zip
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CgroupOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/CgroupOnly/ScikitMixed/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CgroupOnly/ScikitMixed.zip to /data/models/DynComponentPower/CgroupOnly/ScikitMixed.zip
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/BPFOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/BPFOnly/ScikitMixed/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/BPFOnly/ScikitMixed.zip to /data/models/DynComponentPower/BPFOnly/ScikitMixed.zip
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/KubeletOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/KubeletOnly/ScikitMixed/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/IRQOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/IRQOnly/ScikitMixed/metadata.json: 404
* Debugger is active!
* Debugger PIN: 350-740-140
rootfs commented 1 year ago

container_power.go:105] No ContainerComponentPower Model indicates missing container power model and that may result in zeros in power estimate.

@sunya-ch @KaiyiLiu1234

sunya-ch commented 1 year ago

@rootfs One reason is the kepler-model-server is not functional. However, it should use the initial model weight. Note that the init URL is pointing to model with hw counter feature which may not be available on the system.

I0523 10:03:56.736924       1 lr.go:164] LR Model (DynComponentModelWeight): loadWeightFromURLorLocal(https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CounterOnly/ScikitMixed/ScikitMixed.json): map[dram:{{0.8318076441807906 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0.22602678738192125} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 0.146880994775066} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 0}]}} pkg:{{24.388564716241596 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 15.858373957810427} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 8.25749138735891}]}}]
I0523 10:03:56.736959       1 model.go:79] Model DynComponentModelWeight initiated (true)

I think the problem may also come from the BPF detection. Because there is no active process detected above.

sunya-ch commented 1 year ago

@feven-redhat Is the problem still there with the latest version?