hubblo-org / scaphandre

⚡ Energy consumption metrology agent. Let "scaph" dive and bring back the metrics that will help you make your systems and applications more sustainable !
Apache License 2.0
1.65k stars 109 forks source link

'IndexError: list index out of range' in Prometheus Scraping of Scaphandre Metrics #355

Open CherifMZ opened 9 months ago

CherifMZ commented 9 months ago

I have successfully installed Scaphandre on my Kubernetes cluster using the provided documentation here. The installation command includes enabling ServiceMonitor and setting the interval to 2 seconds:

helm install scaphandre helm/scaphandre --set serviceMonitor.enabled=true --set serviceMonitor.interval=2s

Additionally, I have set up Prometheus and adjusted its configuration to a global scraping interval of 2 seconds with a timeout of 1 second.

My objective is to monitor the energy usage metric of each node, for which I created a Python script executed in a Jupyter Notebook. The script queries Prometheus for the 'scaph_host_energy_microjoules' metric in a loop:

import requests
import time

prometheus = 'http://localhost:9090/'

while True:
    energy_query = 'scaph_host_energy_microjoules'
    response_energy = requests.get(prometheus + '/api/v1/query', params={'query': energy_query})
    result_energy = response_energy.json().get('data', {}).get('result', [])

    # The error occurs here after some runtime
    energy_usage = float(result_energy[action]['value'][1])

    time.sleep(5)

After running the script for approximately 40 minutes, an 'IndexError: list index out of range' occurs. This issue seems to indicate that Prometheus is unable to scrape metrics from all three nodes consistently. It appears that the Scaphandre pod responsible for gathering node metrics periodically goes down and then restarts, causing intermittent interruptions (it is like sometimes, for 1s, two pods out of 3 works).

Additional details:

I suspect that the problem might be related to the scrape interval. Your insights and suggestions on resolving this issue would be greatly appreciated. Thank you in advance for your assistance.

uname -a
Linux my_pc 6.1.0-1029-oem #29-Ubuntu SMP PREEMPT_DYNAMIC Tue Jan  9 21:07:34 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cat /proc/cpuinfo
model       : 186
model name  : 13th Gen Intel(R) Core(TM) i5-1335U
mmadoo commented 9 months ago

I am using a scrape interval of 5 minutes and a scrape timeout of 2 minutes. Why is the reason to set a such short interval ? The risk with that timeout is that scaphandre uses all CPU if you a lot of pods.

CherifMZ commented 9 months ago

I am using a scrape interval of 5 minutes and a scrape timeout of 2 minutes. Why is the reason to set a such short interval ? The risk with that timeout is that scaphandre uses all CPU if you a lot of pods.

I'm using Machine Learning, so I have to have updated data

bpetit commented 1 month ago

Hi, do you have any logs from scaphandre on the nodes, close to the restart ?