Closed daberlin closed 9 months ago
Imho it still makes sense to relate the load average to the amount of cores. Load is the number of running tasks as well as uninterruptible tasks such as disk reads waiting to happen. As an 8 core system can run more tasks in parallel than a 1 core system, the maximum load with no tasks waiting would still be higher. As for the TASK_UNINTERRUPTIBLE number, I agree - it is not dependent on the number of cores at all. However a load over 100% with the CPU being idle still is useful information and if there are lots of disk reads waiting to happen (e.g.) the graph will quickly exceed 100% despite the number being divided by node_cpu_seconds_total.
imho there is no better way to display this if you desire a number that can be compared between nodes with different number of CPUs. just my 2 cents though.
Agreed - dividing it by the number of cores makes sense... but not dividing it by 100 and putting a percent sign behind it - at least imho.
Hi,
Thanks for your suggestions, they make sense.
Please could you check if this query is close to the right report?. Anyway, it would be nice to have a better way to count the cpu cores:
scalar(avg_over_time(node_load5{instance="$node",job="$job"}[$__rate_interval])) / count(node_cpu_seconds_total{instance="$node",job="$job",mode="system"})
Looks good
Hi. The two gauges "Sys Load (5m avg)" and "Sys Load (15m avg)" in the first row use a calculation where the load value is multiplied by 100 and divided by node_cpu_seconds_total. This does not make sense, as the load is not directly related to the CPU, but a dimensionless value saying how many processes are queued and waiting for resources (not only CPU cycles).
IMHO stat tiles using "node_load1{}[1h]" etc. make more sense...