sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.16k stars 181 forks source link

kepler-action v0.0.8 does not deploy clusters #1697

Open dave-tucker opened 2 months ago

dave-tucker commented 2 months ago

What happened?

Dependabot tried to upgrade to v0.0.8 but it failed to deploy the cluster.

What did you expect to happen?

It should deploy the cluster.

How can we reproduce it (as minimally and precisely as possible)?

Create a PR to bump that dependency

Anything else we need to know?

No response

Kepler image tag

Kubernetes version

```console $ kubectl version # paste output here ```

Cloud provider or bare metal

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} # paste output here # provide kepler deployment description $ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} ``` For standalone: # put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

dave-tucker commented 2 months ago

@SamYuan1990 can you take a look at this? You can easily reproduce the failure by opening a PR that updates the version of the kepler action used in this repo.

SamYuan1990 commented 2 months ago

@SamYuan1990 can you take a look at this? You can easily reproduce the failure by opening a PR that updates the version of the kepler action used in this repo.

I made this, but it auto closed by dep bot.... ref https://github.com/sustainable-computing-io/kepler/pull/1673

SamYuan1990 commented 2 months ago

@SamYuan1990 can you take a look at this? You can easily reproduce the failure by opening a PR that updates the version of the kepler action used in this repo.

does unit test failure been fixed? in previous PR, the CI is broken by e2e test failure as https://github.com/sustainable-computing-io/kepler/actions/runs/10319483843/job/28568774501

SamYuan1990 commented 2 months ago
kepler_node_info{components_power_source="estimator",cpu_architecture="Zen 3",platform_power_source="none",source="os"} 1
=== RUN   TestE2eTest
Running Suite: E2eTest Suite - /home/runner/work/kepler/kepler/e2e/integration-test
===================================================================================
Random Seed: 17[232](https://github.com/sustainable-computing-io/kepler/actions/runs/10319483843/job/28568774501#step:11:233)08972

Will run 12 of 12 specs
time="2024-08-09T13:09:32Z" level=info msg="Parsing Metrics..."
••••••S
------------------------------
• [FAILED] [0.001 seconds]
Metrics check should pass Check pod level metrics for details, no zero value metric should be found [It] Entry: kepler_container_core_joules_total
/home/runner/work/kepler/kepler/e2e/integration-test/e2e_metric_test.go:310

  [FAILED] Metric kepler_container_core_joules_total should exists for pod kepler-exporter-x5j5j
  Expected
      <bool>: false
  to be true
  In [It] at: /home/runner/work/kepler/kepler/e2e/integration-test/e2e_metric_test.go:[246](https://github.com/sustainable-computing-io/kepler/actions/runs/10319483843/job/28568774501#step:11:247) @ 08/09/24 13:09:32.9
------------------------------
SSSS

Summarizing 1 Failure:
  [FAIL] Metrics check should pass Check pod level metrics for details, no zero value metric should be found [It] Entry: kepler_container_core_joules_total
  /home/runner/work/kepler/kepler/e2e/integration-test/e2e_metric_test.go:246

Ran 7 of 12 Specs in 0.153 seconds
FAIL! -- 6 Passed | 1 Failed | 0 Pending | 5 Skipped
--- FAIL: TestE2eTest (0.16s)
dave-tucker commented 2 months ago

@SamYuan1990 yes that failure was fixed in #1686 I had to ignore the the kepler action dependency in the last dependabot update as it was failing to deploy the cluster.