sustainable-computing-io / kepler-model-server

Model Server for Kepler
Apache License 2.0
25 stars 26 forks source link

Hidden power consumption of Kepler seems to significantly increases #359

Open sunya-ch opened 3 months ago

sunya-ch commented 3 months ago

What happened?

Data source: ec2 spot instance 5c.metal

This issue describes the significant different between power metrics collected on Feb and the power metrics collected on July. While CPU time from both are fair, the power consumption on July seems to much more increases from beginning even with a small load. The power of this machine seems to saturate around 450. These power number are from intel rapl directly.

Further investigation found that in July, the CPU instruction counter is highly increased compare to those in Feb.

previously (around Feb 2024)

Screenshot 2024-08-09 at 14 37 26 Screenshot 2024-08-09 at 14 48 52

current (July 2024)

Screenshot 2024-08-09 at 14 42 47

Screenshot 2024-08-09 at 14 48 44

What did you expect to happen?

Increment of CPU instruction used by Kepler should be explainable. We should further investigate more metrics since CPU time is not enough for modeling.

How can we reproduce it (as minimally and precisely as possible)?

Run Kepler release in Feb separately from Kepler release in July.

Anything else we need to know?

No response

Kepler image tag

0.7.0 and 0.7.11

Deployment

Kepler model server image tag if deployed

Kepler estimator image tag if deployed

Kepler online trainer image tag if deployed

Kepler offline trainer image tag if deployed

Kepler profiler image tag if deployed

Kubernetes version

```console $ kubectl version # paste output here ```

Install tools

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} # paste output here # provide kepler model server configmap if Kepler Model Server is deployed $ kubectl get configmap kepler-model-server-cfm -n ${KEPLER_NAMESPACE} # paste output here # provide kepler deployment description $ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} ``` For standalone: # put your Kepler command argument here
sunya-ch commented 3 months ago

Feb build info:

{"kepler_exporter_build_info": [{"metric": {"name": "kepler_exporter_build_info", "container": "kepler-exporter", "endpoint": "http", "goarch": "amd64", "goos": "linux", "goversion": "go1.20.10", "instance": "kind-for-training-control-plane", "job": "kepler-exporter", "namespace": "kepler", "pod": "kepler-exporter-xbkgb", "revision": "unknown", "service": "kepler-exporter", "tags": "include_gcs,include_oss,containers_image_openpgp,gssapi,providerless,netgo,osusergo,gpu,libbpf,linux"}

July build info:

{ "kepler_exporter_build_info": [{"metric": {"name": "kepler_exporter_build_info", "arch": "amd64", "branch": "main", "container": "kepler-exporter", "endpoint": "http", "instance": "kind-for-training-control-plane", "job": "kepler-exporter", "namespace": "kepler", "os": "linux", "pod": "kepler-exporter-qp8cc", "revision": "bf1f62d8c580aa742d4ae90dedaff70044be9b78", "service": "kepler-exporter", "version": "v0.7.11"}

vimalk78 commented 3 months ago

since this issue talks about kepler's power metrics, should this be a kepler issue or model-server issue?

vimalk78 commented 3 months ago

We should further investigate more metrics since CPU time is not enough for modeling.

can you please elaborate this more? do we need to use more metrics provided by kepler? or kepler itself needs to produce more metrics to be used as new features in model?