siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.89k stars 555 forks source link

[Grafana / Prometheus] pod having cpu throttling but it never hit the limit on Grafana #9770

Closed rubber-ant closed 16 hours ago

rubber-ant commented 1 day ago

Bug Report

pod having cpu throttling but it never hit the limit on Grafana

Description

On talosctl dashboard on the monitor tab I can see when the PID for this pod with 1 CPU takes CPU% = 100% , meaning the full CPU allocated but on grafana it show less 40% - 60% but it never reach 100% of CPU utilisation.

I noticed a pod with CPU limits set to 1, where computations take 5 seconds. When I remove the CPU limit or increase it to 8, the computation time drops to 0.2–0.5 seconds.

On thetalosctl dashboard (monitor tab), the pod's PID shows 100% CPU% usage, meaning it's fully utilising the allocated CPU and it's what I'm expecting to see on Grafana. However, Grafana shows only 40–60% CPU utilization, never reaching 100%.

On Grafana shows usage of 0.4-0.6 CPU with limit to 1.

image

I tried set Prometheus with:

      - prometheus:
          prometheusSpec:
            scrapeInterval: 1s
            evaluationInterval: 1s

also in Grafana on the chart CPU set Min interval to 1s without any luck

Environment

smira commented 1 day ago

Is this issue for Talos Linux or Grafana?

rubber-ant commented 1 day ago

The computation is 3-4 seconds but Grafana/Prometheus for some reason is not able to show same as taloctl dashboard

smira commented 22 hours ago

So I don't quite see what would you like us to fix on Talos side? Do you think that dashboard shows CPU usage wrong?

smira commented 22 hours ago

Also please keep in mind that CPU usage is split at least into user/sys time, talosctl dashboard shows aggregate of both.

rubber-ant commented 21 hours ago

I initially thought this was a bug kernel on Talos, but I no longer believe this to be the case.

anyway , what kernel is using tag v1.8.2 ?

smira commented 20 hours ago

You can check yourself with kubectl get nodes -o wide ;)

I don't think it's a bug anywhere, but your need to understand metrics and they way are reported and presented a bit more to get to the correct conclusion.