sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.18k stars 184 forks source link

panic when using pre-trained model #1476

Open lianhao opened 5 months ago

lianhao commented 5 months ago

What happened?

When running the kepler in K8S with the pretrained model to estimate the process power, kepler pod just go panics after launch.

The models are trained by following kepler model server tekton training process, using the complete run.

Kepler container goes into error just after it started:

<omit>
I0529 02:05:36.588422  690634 exporter.go:175] starting to listen on 0.0.0.0:9102
I0529 02:05:36.588445  690634 exporter.go:181] Started Kepler in 2.243991957s
I0529 02:05:39.594488  690634 exporter.go:457] successfully get data with batch get and delete with 700 pids in 3.298332ms
I0529 02:05:39.914526  690634 estimate.go:139] estimator unmarshal error: json: cannot unmarshal array into Go struct field ComponentPowerResponse.powers of type map[string][]float64 ({"powers": [], "msg": "\"None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]\"\n"})
I0529 02:05:39.914657  690634 process_energy.go:210] Could not estimate the Process Platform Power
panic: runtime error: index out of range [0] with length 0

goroutine 33 [running]:
github.com/sustainable-computing-io/kepler/pkg/model.addEstimatedEnergy({0xc000746400, 0x3d, 0xc00075c820?}, 0x0?, 0x1)
        /workspace/pkg/model/process_energy.go:219 +0xbf0
github.com/sustainable-computing-io/kepler/pkg/model.UpdateProcessEnergy(0xc0005d4000?, 0xc000b88660?)
        /workspace/pkg/model/process_energy.go:145 +0x145
github.com/sustainable-computing-io/kepler/pkg/collector/energy.UpdateProcessEnergy(...)
        /workspace/pkg/collector/energy/process_energy_collector.go:26
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).UpdateProcessEnergyUtilizationMetrics(...)
        /workspace/pkg/collector/metric_collector.go:152
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).UpdateEnergyUtilizationMetrics(0xc0005d4000)
        /workspace/pkg/collector/metric_collector.go:139 +0x2a
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Update(0xb2d05e00?)
        /workspace/pkg/collector/metric_collector.go:113 +0x65
github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start.func1()
        /workspace/pkg/manager/manager.go:75 +0x7b
created by github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start in goroutine 1
        /workspace/pkg/manager/manager.go:67 +0x65

There are some errors in kepler-estimator container too:

<omit>
failed to get model from request {"metrics":["bpf_page_cache_hit","task_clock_ms","bpf_cpu_time_ms","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_instructions","cache_miss"],"values":[[0,0,0,0,0,0,0,0,0]],"output_type":"DynPower","source":"acpi","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"GradientBoostingRegressorTrainer","filter":""}
get archived model
failed to get model from request {"metrics":["bpf_page_cache_hit","task_clock_ms","bpf_cpu_time_ms","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_instructions","cache_miss"],"values":[[0,0,0,0,0,0,0,0,0]],"output_type":"DynPower","source":"intel_rapl","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"GradientBoostingRegressorTrainer","filter":""}
get archived model
failed to get model from request {"metrics":["bpf_page_cache_hit","task_clock_ms","bpf_cpu_time_ms","bpf_net_tx_irq","bpf_net_rx_irq","bpf_block_irq","cpu_cycles","cpu_instructions","cache_miss"],"values":[[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0]],"output_type":"DynPower","source":"intel_rapl","system_features":["cpu_architecture"],"system_values":["Skylake"],"trainer_name":"GradientBoostingRegressorTrainer","filter":""}
GradientBoostingRegressorTrainer_1 fail to predict, removed: "None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]"
<omit>

The complete kepler log can be found here : kepler.log The complete kepler-estimator log can be found here: kepler-estimator.log

What did you expect to happen?

Kepler should be run without any panics

How can we reproduce it (as minimally and precisely as possible)?

run kepler with the kepler deployment configurations below.

Anything else we need to know?

No response

Kepler image tag

kepler: quay.io/sustainable_computing_io/kepler:latest estimator: quay.io/sustainable_computing_io/kepler_model_server:latest

Kubernetes version

```console $ kubectl version Client Version: v1.29.3 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.2 ```

Cloud provider or bare metal

bare metal

OS version

```console $ cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy $ uname -a Linux onap02 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux ```

Install tools

Using manifest

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl describe configmap kepler-cfm -n ${KEPLER_NAMESPACE} Name: kepler-cfm Namespace: kepler Labels: sustainable-computing.io/app=kepler Annotations: Data ==== CGROUP_METRICS: ---- * EXPOSE_IRQ_COUNTER_METRICS: ---- true KEPLER_LOG_LEVEL: ---- 5 KEPLER_NAMESPACE: ---- kepler MODEL_CONFIG: ---- PROCESS_COMPONENTS_ESTIMATOR=true PROCESS_COMPONENTS_INIT_URL=http://onap01.sh.intel.com/kepler_models/CompleteTrainPipelineExample/intel_rapl/DynPower/Basic/GradientBoostingRegressorTrainer_0.zip PROCESS_COMPONENTS_TRAINER=GradientBoostingRegressorTrainer PROCESS_TOTAL_ESTIMATOR=true PROCESS_TOTAL_INIT_URL=http://onap01.sh.intel.com/kepler_models/CompleteTrainPipelineExample/acpi/DynPower/Basic/GradientBoostingRegressorTrainer_0.zip PROCESS_TOTAL_TRAINER=GradientBoostingRegressorTrainer PROMETHEUS_SCRAPE_INTERVAL: ---- 30s CPU_ARCH_OVERRIDE: ---- ENABLE_GPU: ---- true ENABLE_QAT: ---- false EXPOSE_CGROUP_METRICS: ---- false EXPOSE_HW_COUNTER_METRICS: ---- true BIND_ADDRESS: ---- 0.0.0.0:9102 ENABLE_PROCESS_METRICS: ---- false MAX_LOOKUP_RETRY: ---- 1 REDFISH_PROBE_INTERVAL_IN_SECONDS: ---- 60 ENABLE_EBPF_CGROUPID: ---- true METRIC_PATH: ---- /metrics REDFISH_SKIP_SSL_VERIFY: ---- true BinaryData ==== Events: # provide kepler deployment description $ kubectl describe daemonset kepler-exporter -n ${KEPLER_NAMESPACE} Name: kepler-exporter Selector: app.kubernetes.io/component=exporter,app.kubernetes.io/name=kepler-exporter,sustainable-computing.io/app=kepler Node-Selector: Labels: sustainable-computing.io/app=kepler Annotations: deprecated.daemonset.template.generation: 1 Desired Number of Nodes Scheduled: 1 Current Number of Nodes Scheduled: 1 Number of Nodes Scheduled with Up-to-date Pods: 1 Number of Nodes Scheduled with Available Pods: 0 Number of Nodes Misscheduled: 0 Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed Pod Template: Labels: app.kubernetes.io/component=exporter app.kubernetes.io/name=kepler-exporter sustainable-computing.io/app=kepler Service Account: kepler-sa Containers: kepler-exporter: Image: quay.io/sustainable_computing_io/kepler:latest Port: 9102/TCP Host Port: 0/TCP Command: /bin/sh -c Args: until [ -e /tmp/estimator.sock ]; do sleep 1; done && /usr/bin/kepler -v=5 Requests: cpu: 100m memory: 400Mi Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5 Environment: NODE_IP: (v1:status.hostIP) NODE_NAME: (v1:spec.nodeName) Mounts: /etc/kepler/kepler.config from cfm (ro) /etc/redfish from redfish (ro) /lib/modules from lib-modules (ro) /proc from proc (rw) /sys from tracing (ro) /tmp from tmp (rw) /usr/src from usr-src (rw) /var/run from var-run (rw) estimator: Image: quay.io/sustainable_computing_io/kepler_model_server:latest Port: Host Port: Command: python3.8 Args: -u src/estimate/estimator.py Environment: Mounts: /etc/kepler/kepler.config from cfm (ro) /mnt from mnt (rw) /tmp from tmp (rw) Volumes: tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: mnt: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: proc: Type: HostPath (bare host directory volume) Path: /proc-host HostPathType: Directory usr-src: Type: HostPath (bare host directory volume) Path: /usr/src HostPathType: Directory lib-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: Directory tracing: Type: HostPath (bare host directory volume) Path: /sys HostPathType: Directory var-run: Type: HostPath (bare host directory volume) Path: /var/run HostPathType: Directory cfm: Type: ConfigMap (a volume populated by a ConfigMap) Name: kepler-cfm Optional: false redfish: Type: Secret (a volume populated by a Secret) SecretName: redfish-4kh9d7bc7m Optional: false Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 29m daemonset-controller Created pod: kepler-exporter-9wxsp ```

Container runtime (CRI) and version (if applicable)

containerd://1.7.13

Related plugins (CNI, CSI, ...) and versions (if applicable)

CNI: kindnet
sunya-ch commented 5 months ago

It seems the trained power model using the CPU time metric exported by Kepler before v0.7 (bpf_cpu_time_us); however, the estimation is called by the new Kepler (with bpf_cpu_time_ms). You may have to retrain the power model with new Kepler version.

I0529 02:05:34.744887  690634 utils.go:86] Available ebpf counters: [bpf_page_cache_hit task_clock_ms bpf_cpu_time_ms bpf_net_tx_irq bpf_net_rx_irq bpf_block_irq cpu_cycles cpu_instructions cache_miss]
...
I0529 02:05:39.914526  690634 estimate.go:139] estimator unmarshal error: json: cannot unmarshal array into Go struct field ComponentPowerResponse.powers of type map[string][]float64 ({"powers": [], "msg": "\"None of [Index(['bpf_cpu_time_us'], dtype='object')] are in the [columns]\"\n"})