Also did a bit of variable name shuffle to make GPU metrics have gpu prefix and make it clear which are average and which are median.
The per-partition metrics have a lot of initialization code, that the metrics all have 0 for all partitions (that are not filtered) rather than sometimes simply having no values and being expired by Prometheus.
The metrics for jobs waiting/pending also have the reason as a label, which we find very useful when looking back at what kinds of pending reasons are accumulating on the clusters. The median/average is done per-partition per-reason.
Fixes #11
Also did a bit of variable name shuffle to make GPU metrics have
gpu
prefix and make it clear which are average and which are median.The per-partition metrics have a lot of initialization code, that the metrics all have 0 for all partitions (that are not filtered) rather than sometimes simply having no values and being expired by Prometheus.
The metrics for jobs waiting/pending also have the reason as a label, which we find very useful when looking back at what kinds of pending reasons are accumulating on the clusters. The median/average is done per-partition per-reason.