Open sunya-ch opened 3 months ago
Feb build info:
{"kepler_exporter_build_info": [{"metric": {"name": "kepler_exporter_build_info", "container": "kepler-exporter", "endpoint": "http", "goarch": "amd64", "goos": "linux", "goversion": "go1.20.10", "instance": "kind-for-training-control-plane", "job": "kepler-exporter", "namespace": "kepler", "pod": "kepler-exporter-xbkgb", "revision": "unknown", "service": "kepler-exporter", "tags": "include_gcs,include_oss,containers_image_openpgp,gssapi,providerless,netgo,osusergo,gpu,libbpf,linux"}
July build info:
{ "kepler_exporter_build_info": [{"metric": {"name": "kepler_exporter_build_info", "arch": "amd64", "branch": "main", "container": "kepler-exporter", "endpoint": "http", "instance": "kind-for-training-control-plane", "job": "kepler-exporter", "namespace": "kepler", "os": "linux", "pod": "kepler-exporter-qp8cc", "revision": "bf1f62d8c580aa742d4ae90dedaff70044be9b78", "service": "kepler-exporter", "version": "v0.7.11"}
since this issue talks about kepler's power metrics, should this be a kepler issue or model-server issue?
We should further investigate more metrics since CPU time is not enough for modeling.
can you please elaborate this more? do we need to use more metrics provided by kepler? or kepler itself needs to produce more metrics to be used as new features in model?
What happened?
Data source: ec2 spot instance 5c.metal
This issue describes the significant different between power metrics collected on Feb and the power metrics collected on July. While CPU time from both are fair, the power consumption on July seems to much more increases from beginning even with a small load. The power of this machine seems to saturate around 450. These power number are from intel rapl directly.
Further investigation found that in July, the CPU instruction counter is highly increased compare to those in Feb.
previously (around Feb 2024)
current (July 2024)
What did you expect to happen?
Increment of CPU instruction used by Kepler should be explainable. We should further investigate more metrics since CPU time is not enough for modeling.
How can we reproduce it (as minimally and precisely as possible)?
Run Kepler release in Feb separately from Kepler release in July.
Anything else we need to know?
No response
Kepler image tag
Deployment
Kepler model server image tag if deployed
Kepler estimator image tag if deployed
Kepler online trainer image tag if deployed
Kepler offline trainer image tag if deployed
Kepler profiler image tag if deployed
Kubernetes version
Install tools
Kepler deployment config