rivosinc / prometheus-slurm-exporter

Export select slurm metrics to prometheus
Apache License 2.0
37 stars 12 forks source link

User Metrics has wrong value #60

Closed liu-shaobo closed 5 months ago

liu-shaobo commented 6 months ago
image

hi,does "slurm_user_cpu_alloc" refer to the number of cores? my cluster doesn’t have that many cores.

abhinavDhulipala commented 6 months ago

Hi! Thanks again for filing an issue. What mode are you running the exporter in? --cli-fallback or json?

abhinavDhulipala commented 6 months ago

User Cpu alloc is calculated by the number of CPU's Slurm has allocated for that user. If you've configured your cluster in a way that allows slurm to oversubscribe the number of cores per machine, it will show more allocated nodes than physical cores available in the cluster. The logic can be found here

abhinavDhulipala commented 6 months ago

Closing due to inactivity

liu-shaobo commented 5 months ago

Closing due to inactivity use --cli-fallback parameters, not using oversubscribe,It seems that pending and running jobs are added together.

abhinavDhulipala commented 5 months ago

I see, I don't think I have enough information. Do you mind doing the following:

abhinavDhulipala commented 4 months ago

I think this has been solved after adding state to slurm_user_cpu_alloc. The problem here was that we were adding all states, hence it'd be above total