ubccr / slurm-exporter

Slurm Exporter for Prometheus
GNU General Public License v3.0
14 stars 2 forks source link

Add per-partition metrics and fix how node states are pulled #13

Closed treydock closed 2 years ago

treydock commented 2 years ago

Fixes #11

Also did a bit of variable name shuffle to make GPU metrics have gpu prefix and make it clear which are average and which are median.

The per-partition metrics have a lot of initialization code, that the metrics all have 0 for all partitions (that are not filtered) rather than sometimes simply having no values and being expired by Prometheus.

The metrics for jobs waiting/pending also have the reason as a label, which we find very useful when looking back at what kinds of pending reasons are accumulating on the clusters. The median/average is done per-partition per-reason.