hubblo-org / scaphandre

⚡ Energy consumption metrology agent. Let "scaph" dive and bring back the metrics that will help you make your systems and applications more sustainable !
Apache License 2.0
1.64k stars 109 forks source link

Kubernetes Scaphandre Deployment reporting 0 W #353

Open eduardogomescampos1 opened 9 months ago

eduardogomescampos1 commented 9 months ago

Bug description

First of all, I would like to thank all the Scaphandre team for a tool like this. It has been extremely helpful so far! So, the bug consists on some nodes from my local k8s cluster reporting 0W of consumption. As way to illustrate the issue, there is a print screen from the official Scaphandre Grafana dashboard on the screenshot section. Each color represents a node and, as you can see, 3 of them report 0W. The thing that is most intriguing is that if I run Scaphandre locally, I'm able to get actual values. There is also a print screen of the logs of a local execution of Scaphandre in one of those nodes reporting 0W in the k8s version. As you can see, Scaphandre is able to obtain those metrics locally, however, the pods from the k8s cluster cannot. Doing "kubectl logs 'scaphandre pod ' " has been of no help since it just returns: " Scaphandre prometheus exporter Sending ⚡ metrics Press CTRL-C to stop scaphandre " And describing the pods does not return anything worth mentioning either. It is relevant to note that the firewall is disabled on all cluster machines. Could you give any insights on solving this, please?

To Reproduce

  1. Create a k8s cluster using Calico CNI following its documentation
  2. Create a deployment for Grafana and Prometheus (following these tutorials: https://devopscube.com/setup-grafana-kubernetes/ and https://devopscube.com/setup-prometheus-monitoring-on-kubernetes/)
  3. Deploy Scaphandre from its Helm Chart
  4. Open Scaphandre Grafana dashboard and verify that some nodes report 0W

Expected behavior

The Grafana dashboard should report the same values obtained from the local execution rather than 0W

Screenshots

image

Additional context

One interesting aspect is that all of malfunctioning machines have been formatted quite recently so I'm guessing there might be a misconfiguration somewhere.

mmadoo commented 9 months ago

Which docker tag are you using and what is the value of the metrics scaph_self_version?

I am using dev tag and got version 0.5. My metrics for scaph_process_power_consumption_microwattsare fine. image

eduardogomescampos1 commented 9 months ago

Which docker tag are you using and what is the value of the metrics scaph_self_version?

I am using dev tag and got version 0.5. My metrics for scaph_process_power_consumption_microwattsare fine. image

All nodes return 0.5 for this metric. I have installed the helm chart from the dev branch using the dev tag as well. Besides, something I also noted is that whenever I run the quick docker version (as in https://hubblo-org.github.io/scaphandre-documentation/tutorials/installation-linux) I also get reported 0W on one of the malfunctioning nodes. I feel like this has something to do with the container not being allowed to access the proper files, even though I have disabled all firewalls and used the command chmod 777 on both /sys/class/powercap and /proc (for testing purposes). I'm wondering why only one node is able to get the measurements correctly.

image Docker quick version output

eduardogomescampos1 commented 9 months ago

Now I've tried to run the dev image locally and there is a warning
"scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 2, kind: NotFound, message: "No such file or directory" }" However, as I have stated before, I have used the chmod -R 777 command on this folder and disabled the firewall. What could be causing this?

eduardogomescampos1 commented 9 months ago

It is indeed a permission issue. As I came back to office and typed "kubectl logs 'scaphandre pod'", this time a got a warning message stating: "scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }"

eduardogomescampos1 commented 8 months ago

I think it had something to do with the containerd container runtime. In the project I'm taking part on we decided to change from containerd to CRI-O and the problem was solved afterwards. All nodes report sensible values now

bpetit commented 4 weeks ago

Hi, it seems related to https://github.com/hubblo-org/scaphandre/pull/391 that has been merged in dev a few days ago.

If anyone wants to give it a try with a containerd runtime that would be interesting.

Now I've tried to run the dev image locally and there is a warning "scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 2, kind: NotFound, message: "No such file or directory" }" However, as I have stated before, I have used the chmod -R 777 command on this folder and disabled the firewall. What could be causing this?

This would be related to a intel-rapl module issue, not scaphandre itself.

It is indeed a permission issue. As I came back to office and typed "kubectl logs 'scaphandre pod'", this time a got a warning message stating: "scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }"

This would be related (probably) to #391