uabrc / devops-docs

https://docs.rc.uab.edu/devops-docs/
Apache License 2.0
1 stars 7 forks source link

What to do when there are reports of Slurm outages or unexpected behavior? #21

Open wwarriner opened 1 year ago

wwarriner commented 1 year ago

Resolve the issue. Ops has a pretty good handle on these generally. Often we simply restart slurmctld or whatever process is causing the upset.

Tools to investigate causes:

  1. sdiag output for remote procedure calls
  2. sacct for recent job information
  3. Grafana for node data (https://grafana.ops.rc.uab.edu/grafana/?orgId=1, requires VPN or on-campus)