Open eduardogomescampos1 opened 9 months ago
Which docker tag are you using and what is the value of the metrics scaph_self_version
?
I am using dev tag and got version 0.5. My metrics for scaph_process_power_consumption_microwatts
are fine.
Which docker tag are you using and what is the value of the metrics
scaph_self_version
?I am using dev tag and got version 0.5. My metrics for
scaph_process_power_consumption_microwatts
are fine.
All nodes return 0.5 for this metric. I have installed the helm chart from the dev branch using the dev tag as well. Besides, something I also noted is that whenever I run the quick docker version (as in https://hubblo-org.github.io/scaphandre-documentation/tutorials/installation-linux) I also get reported 0W on one of the malfunctioning nodes. I feel like this has something to do with the container not being allowed to access the proper files, even though I have disabled all firewalls and used the command chmod 777 on both /sys/class/powercap and /proc (for testing purposes). I'm wondering why only one node is able to get the measurements correctly.
Docker quick version output
Now I've tried to run the dev image locally and there is a warning
"scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 2, kind: NotFound, message: "No such file or directory" }"
However, as I have stated before, I have used the chmod -R 777 command on this folder and disabled the firewall. What could be causing this?
It is indeed a permission issue. As I came back to office and typed "kubectl logs 'scaphandre pod'", this time a got a warning message stating: "scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }"
I think it had something to do with the containerd container runtime. In the project I'm taking part on we decided to change from containerd to CRI-O and the problem was solved afterwards. All nodes report sensible values now
Hi, it seems related to https://github.com/hubblo-org/scaphandre/pull/391 that has been merged in dev a few days ago.
If anyone wants to give it a try with a containerd runtime that would be interesting.
Now I've tried to run the dev image locally and there is a warning "scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 2, kind: NotFound, message: "No such file or directory" }" However, as I have stated before, I have used the chmod -R 777 command on this folder and disabled the firewall. What could be causing this?
This would be related to a intel-rapl module issue, not scaphandre itself.
It is indeed a permission issue. As I came back to office and typed "kubectl logs 'scaphandre pod'", this time a got a warning message stating: "scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }"
This would be related (probably) to #391
Bug description
First of all, I would like to thank all the Scaphandre team for a tool like this. It has been extremely helpful so far! So, the bug consists on some nodes from my local k8s cluster reporting 0W of consumption. As way to illustrate the issue, there is a print screen from the official Scaphandre Grafana dashboard on the screenshot section. Each color represents a node and, as you can see, 3 of them report 0W. The thing that is most intriguing is that if I run Scaphandre locally, I'm able to get actual values. There is also a print screen of the logs of a local execution of Scaphandre in one of those nodes reporting 0W in the k8s version. As you can see, Scaphandre is able to obtain those metrics locally, however, the pods from the k8s cluster cannot. Doing "kubectl logs 'scaphandre pod ' " has been of no help since it just returns: " Scaphandre prometheus exporter Sending ⚡ metrics Press CTRL-C to stop scaphandre " And describing the pods does not return anything worth mentioning either. It is relevant to note that the firewall is disabled on all cluster machines. Could you give any insights on solving this, please?
To Reproduce
Expected behavior
The Grafana dashboard should report the same values obtained from the local execution rather than 0W
Screenshots
Environment
Additional context
One interesting aspect is that all of malfunctioning machines have been formatted quite recently so I'm guessing there might be a misconfiguration somewhere.