Closed abhinavDhulipala closed 10 months ago
I'll update the example dashboard as well
EDIT: Dashboard improved
We don't need to report sdiag avg time. We can simply query the total_time and count counters
(rate(slurm_rpc_user_total_time{instance="$instance"}[$__rate_interval]) / on(user) rate(slurm_rpc_user_count{instance="$instance"}[$__rate_interval])) > 100
This query is pretty much just as performant as reporting avg_time so I'm removing the metric
EDIT: slurm_rpc_msg_type show better with sdiag reported avg, not the case with user reported totals, thus we are keeping the avg time metric for rpc type
Example Vis (in published dashboard)
Collect some baseline rpc stats for now to track load and effects of config changes for admins
resolves #28