rivosinc / prometheus-slurm-exporter

Export select slurm metrics to prometheus
Apache License 2.0
26 stars 5 forks source link

CPU utilization more than 100% #55

Closed KasperSkytte closed 4 months ago

KasperSkytte commented 5 months ago

Hi. I'm using your template grafana dashboard and exporter (version 3264648b54ba9e2c626da1277482517ff1202982) and observing Average CPU utilization higher than 100%. Should this even be possible or how should I interpret this? The cluster has ~1500 CPUs across 8 beefy nodes if that's relevant.

Screenshot from 2024-01-30 14-50-48

abhinavDhulipala commented 5 months ago

Hi thanks for filing an issue again. Are you using the json parser or the cli fallback? Further more, this is using cpu load. Slurm defines pulls cpu load right from the linux. Furthermore, if your cpus are hyperthreaded you could potentially have a cpu load that is 2x. For example, for a 1 core machine that is hyper threaded, the cpu load could show up slightly greater than 200% for that 1 cpu machine. It could also mean that all your machines are chronically over subscribed. If you run top on some of your nodes, or run the node_exporter on your nodes, your cpu_load should map directly to that. For us, it isn't uncommon for us to at a point stress our cpu's to 110% of the number of cores available.Still, 150% seem pretty high indeed. Are you getting any error logs? Are the exporters error counters still zeroed?

KasperSkytte commented 5 months ago

I'm using the cli-fallback, and the template grafana dashboard you've made (thank you!). But it would make sense if hyperthreading was to blame. But in that case, pretty much all x86 CPU's today have that, so 200% is the theoretical max, or is it more complicated than that? There is no SLURM oversubscription configured currently. I occasionally get some job fetch error "signal: killed" and node fetch errorsignal: killed errors. None around the time where it peaked over 150% on the above graph however.

abhinavDhulipala commented 5 months ago

So signal: killed means the timeout was triggered when executing the cmd. Thus our runners had to shut it down and it returned no data, it's unrelated. Assuming the query you are running is the following avg(slurm_partition_cpu_load / slurm_partition_total_cpus). We collect slurm_partition_total_cpus by summing the number of CPUs in a partition. This comes from slurm conf and is typically set by the administrator (until the most recent versions of slurm). Now this query is actually a bit misleading because nodes can be in multiple partitions, thus certain nodes could bias the result of this query. A better label for this graph would be something like Average Utilization per Partition, but that's besides the point. On the other hand, slurm_partition_cpu_load is coming straight from linux. As such, I'd check top vs Slurm allocatable cpus to see if there is a discrepancy. To make a more useful graph I will change it to slurm_cpu_load / slurm_cpu_total.

TL;DR I think the problem is you have a lower amount of allocatable cpus per node than top says. Your job aren't cgroup limited or you've configured oversubscription such that jobs can consume more cpus than they have allocated. This is important because otherwise, slurm wouldn't have scheduled more jobs than it knows it can fit. Thus you see a load that doesn't make sense. LMK if this helps

abhinavDhulipala commented 4 months ago

Marking this stale and assuming my diagnosis above is correct. Please reopen if you're still experiencing problems

KasperSkytte commented 4 months ago

Sorry I forgot about this. Thanks for the explanation, I think you're right. Threads are likely to blame, but I do have a 1:1 ratio of logical CPUs and allocatable CPUs and no nodes span multiple partitions. No oversubscription, but they are VM's, though again 1:1 with number of logical CPUs.