Open CherifMZ opened 9 months ago
I am using a scrape interval of 5 minutes and a scrape timeout of 2 minutes. Why is the reason to set a such short interval ? The risk with that timeout is that scaphandre uses all CPU if you a lot of pods.
I am using a scrape interval of 5 minutes and a scrape timeout of 2 minutes. Why is the reason to set a such short interval ? The risk with that timeout is that scaphandre uses all CPU if you a lot of pods.
I'm using Machine Learning, so I have to have updated data
Hi, do you have any logs from scaphandre on the nodes, close to the restart ?
I have successfully installed Scaphandre on my Kubernetes cluster using the provided documentation here. The installation command includes enabling ServiceMonitor and setting the interval to 2 seconds:
helm install scaphandre helm/scaphandre --set serviceMonitor.enabled=true --set serviceMonitor.interval=2s
Additionally, I have set up Prometheus and adjusted its configuration to a global scraping interval of 2 seconds with a timeout of 1 second.
My objective is to monitor the energy usage metric of each node, for which I created a Python script executed in a Jupyter Notebook. The script queries Prometheus for the 'scaph_host_energy_microjoules' metric in a loop:
After running the script for approximately 40 minutes, an 'IndexError: list index out of range' occurs. This issue seems to indicate that Prometheus is unable to scrape metrics from all three nodes consistently. It appears that the Scaphandre pod responsible for gathering node metrics periodically goes down and then restarts, causing intermittent interruptions (it is like sometimes, for 1s, two pods out of 3 works).
Additional details:
I suspect that the problem might be related to the scrape interval. Your insights and suggestions on resolving this issue would be greatly appreciated. Thank you in advance for your assistance.