rivosinc / prometheus-slurm-exporter

Export select slurm metrics to prometheus
Apache License 2.0
37 stars 12 forks source link

[sdiag] collect slurm daemon stats #29

Closed abhinavDhulipala closed 10 months ago

abhinavDhulipala commented 11 months ago

Collect some baseline rpc stats for now to track load and effects of config changes for admins

resolves #28

abhinavDhulipala commented 11 months ago

I'll update the example dashboard as well

EDIT: Dashboard improved

abhinavDhulipala commented 10 months ago

We don't need to report sdiag avg time. We can simply query the total_time and count counters

(rate(slurm_rpc_user_total_time{instance="$instance"}[$__rate_interval]) / on(user) rate(slurm_rpc_user_count{instance="$instance"}[$__rate_interval])) > 100

This query is pretty much just as performant as reporting avg_time so I'm removing the metric

EDIT: slurm_rpc_msg_type show better with sdiag reported avg, not the case with user reported totals, thus we are keeping the avg time metric for rpc type

abhinavDhulipala commented 10 months ago

Example Vis (in published dashboard)

image