vpenso / prometheus-slurm-exporter

Prometheus exporter for performance metrics from Slurm.
GNU General Public License v3.0
231 stars 142 forks source link

Drain reason from sinfo #33

Closed msf1t closed 4 years ago

msf1t commented 4 years ago

This exporter is fantastic, and we're hoping to get a bit more out of it. I've been looking at the code for node status, and I'd really like to track our drain reasons. I think this would help us spot trends.

Where you are gathering the sinfo https://github.com/vpenso/prometheus-slurm-exporter/blob/master/nodes.go#L113 could you add in %E and grab the reason? What would need to accompany that change for printing it out properly?

Thanks

mtds commented 4 years ago

Unless you have an extremely determined group of sysadmins, which consistently add a proper string into the 'reason' field, you may end up with something like the following (extracted from one of our cluster):

50,idle,none
6,reserved,none
1,drained*,TTS#202010[...]
2,draining,Kill task failed
1,draining,NHC: Watchdog timer unable to terminate hung NHC process 4312.
11,down*,reboot timed out
1,down*,TTS#202010[...]
1,down*,HW problem: node offline
[...]

IMHO, adding the reason will increase the size of the timeseries without giving you much benefits. Unless you let Slurm do the job (e.g. Kill task failed), the risk is ending up with a fragmented view of the status. Plus, I am very skeptical you'll be able to spot some trends from a bunch of strings. This could a job more fit for accounting/reporting than a live dashboard (but in the former case you will not need an exporter, just the sinfo/sreport utilities from Slurm itself).

philmod-h commented 2 months ago

I understand your point about not cluttering the metric with random strings, but in our case it would be very useful to tell the difference between a node draining by Slurm vs a node draining by an operator, as we do not want to alert for the latter.