Open feven-NIT opened 1 year ago
And here are my current log in the kepler-model pod:
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CgroupOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/CgroupOnly/ScikitMixed/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CgroupOnly/ScikitMixed.zip to /data/models/DynComponentPower/CgroupOnly/ScikitMixed.zip
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/BPFOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/BPFOnly/ScikitMixed/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/BPFOnly/ScikitMixed.zip to /data/models/DynComponentPower/BPFOnly/ScikitMixed.zip
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/KubeletOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/KubeletOnly/ScikitMixed/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/IRQOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/IRQOnly/ScikitMixed/metadata.json: 404
* Debugger is active!
* Debugger PIN: 982-944-970
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.0.119 - - [23/May/2023 10:03:56] "POST /model HTTP/1.1" 400 -
10.129.2.36 - - [23/May/2023 10:04:01] "POST /model HTTP/1.1" 400 -
Thank you @feven-redhat for testing this!
The kepler log container_power.go:105] No ContainerComponentPower Model
looks interesting, it probably indicates the model is not there.
@feven-redhat can you set EXPOSE_IRQ_COUNTER_METRICS=false
in the kepler-cfm configmap and restart kepler?
@sunya-ch @KaiyiLiu1234 does the irq metrics cause trouble in dynamic component power model (since they are not in the original training)?
I have retry the deployment in openshift without the estimator (just with make build-manifest OPTS=" OPENSHIFT_DEPLOY CLUSTER_PREREQ_DEPLOY") and it work. But when i try with the estimator or estimator with model i get the same issue.
Here are my log for kepler-exporter.
I0601 08:04:33.668743 1 node_energy_collector.go:60] Node components power model collection is supported
I0601 08:04:33.669098 1 container_power.go:105] No ContainerComponentPower Model
I0601 08:04:33.669109 1 metric_collector.go:137] energy from pod/container (0 active processes): name: node-exporter-nst9r/init-textfile namespace: openshift-monitoring
cgrouppid: 0 pid: [] comm:
Dyn ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0)
Idle ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0)
CPUTime: 0 (0)
NetTX IRQ: 0 (0)
NetRX IRQ: 0 (0)
Block IRQ: 0 (0)
counters: map[cache_miss:0 (0) cpu_cycles:0 (0) cpu_instr:0 (0) cpu_ref_cycles:0 (0)]
cgroupfs: map[block_devices_used:0 (0) cgroupfs_cpu_usage_us:0 (0) cgroupfs_ioread_bytes:0 (0) cgroupfs_iowrite_bytes:0 (0) cgroupfs_kernel_memory_usage_bytes:0 (0) cgroupfs_memory_usage_bytes:0 (0) cgroupfs_system_cpu_usage_us:0 (0) cgroupfs_tcp_memory_usage_bytes:0 (0) cgroupfs_user_cpu_usage_us:0 (0)]
kubelets: map[container_cpu_usage_seconds_total:0 (0) container_memory_working_set_bytes:0 (0)]
I0601 08:04:33.669152 1 metric_collector.go:137] energy from pod/container (1 active processes): name: node-exporter-nst9r/kube-rbac-proxy namespace: openshift-monitoring
cgrouppid: 0 pid: [4990] comm: kube-rbac-proxy
Dyn ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0)
Idle ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0)
CPUTime: 0 (0)
NetTX IRQ: 0 (0)
NetRX IRQ: 0 (0)
Block IRQ: 0 (0)
counters: map[cache_miss:0 (0) cpu_cycles:0 (0) cpu_instr:0 (0) cpu_ref_cycles:0 (0)]
cgroupfs: map[block_devices_used:6 (6) cgroupfs_cpu_usage_us:0 (43600428) cgroupfs_ioread_bytes:0 (0) cgroupfs_iowrite_bytes:0 (0) cgroupfs_kernel_memory_usage_bytes:0 (737280) cgroupfs_memory_usage_bytes:0 (28160000) cgroupfs_system_cpu_usage_us:0 (15900000) cgroupfs_tcp_memory_usage_bytes:0 (0) cgroupfs_user_cpu_usage_us:0 (27690000)]
kubelets: map[container_cpu_usage_seconds_total:0 (43) container_memory_working_set_bytes:0 (22396928)]
Here are the log from the estimator
2023-06-01 08:04:26.906840: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-01 08:04:26.906873: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
And here are the log from the model
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentPower/BPFOnly/KerasCompFullPipeline/metadata.json to /data/models/AbsComponentPower/BPFOnly/KerasCompFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentPower/KubeletOnly/KerasCompFullPipeline/metadata.json to /data/models/AbsComponentPower/KubeletOnly/KerasCompFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentPower/IRQOnly/KerasCompFullPipeline/metadata.json to /data/models/AbsComponentPower/IRQOnly/KerasCompFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentPower/CounterIRQCombined/KerasCompFullPipeline/metadata.json to /data/models/AbsComponentPower/CounterIRQCombined/KerasCompFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentPower/Unknown/KerasCompFullPipeline/metadata.json to /data/models/AbsComponentPower/Unknown/KerasCompFullPipeline/metadata.json: 404
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/Full/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/Full/KerasCompWeightFullPipeline/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/Full/KerasCompWeightFullPipeline/KerasCompWeightFullPipeline.json to /data/models/AbsComponentModelWeight/Full/KerasCompWeightFullPipeline/KerasCompWeightFullPipeline.json
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/WorkloadOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/WorkloadOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/CounterOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/CounterOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/CgroupOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/CgroupOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/BPFOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/BPFOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/KubeletOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/KubeletOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/IRQOnly/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/IRQOnly/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/CounterIRQCombined/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/CounterIRQCombined/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/AbsComponentModelWeight/Unknown/KerasCompWeightFullPipeline/metadata.json to /data/models/AbsComponentModelWeight/Unknown/KerasCompWeightFullPipeline/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/Full/ScikitMixed/metadata.json to /data/models/DynComponentPower/Full/ScikitMixed/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/WorkloadOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/WorkloadOnly/ScikitMixed/metadata.json: 404
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CounterOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/CounterOnly/ScikitMixed/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CounterOnly/ScikitMixed.zip to /data/models/DynComponentPower/CounterOnly/ScikitMixed.zip
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CgroupOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/CgroupOnly/ScikitMixed/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/CgroupOnly/ScikitMixed.zip to /data/models/DynComponentPower/CgroupOnly/ScikitMixed.zip
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/BPFOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/BPFOnly/ScikitMixed/metadata.json
Successfully load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/BPFOnly/ScikitMixed.zip to /data/models/DynComponentPower/BPFOnly/ScikitMixed.zip
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/KubeletOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/KubeletOnly/ScikitMixed/metadata.json: 404
Failed to load https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentPower/IRQOnly/ScikitMixed/metadata.json to /data/models/DynComponentPower/IRQOnly/ScikitMixed/metadata.json: 404
* Debugger is active!
* Debugger PIN: 350-740-140
container_power.go:105] No ContainerComponentPower Model
indicates missing container power model and that may result in zeros in power estimate.
@sunya-ch @KaiyiLiu1234
@rootfs One reason is the kepler-model-server is not functional. However, it should use the initial model weight. Note that the init URL is pointing to model with hw counter feature which may not be available on the system.
I0523 10:03:56.736924 1 lr.go:164] LR Model (DynComponentModelWeight): loadWeightFromURLorLocal(https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CounterOnly/ScikitMixed/ScikitMixed.json): map[dram:{{0.8318076441807906 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0.22602678738192125} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 0.146880994775066} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 0}]}} pkg:{{24.388564716241596 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 15.858373957810427} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 8.25749138735891}]}}]
I0523 10:03:56.736959 1 model.go:79] Model DynComponentModelWeight initiated (true)
I think the problem may also come from the BPF detection. Because there is no active process detected above.
@feven-redhat Is the problem still there with the latest version?
Describe the bug
I'm trying to install kepler on openshift (4.12 and kube 1.25), But i get no data in the grafana and when i take a look on /metrics most of the value are equal to zeros.
Here are the log on the kepler exporter
Here is an exemple of value in metrics
To reproduce
Install kepler on openshift using Using