pcchen / scopion

Scopion cluster
2 stars 0 forks source link

scopion down #7

Open aronton opened 1 year ago

aronton commented 1 year ago

Some nodes ( ex scopion101,102... ) turn to state "down" and is not running, Maybe we should fix it. I have chcked the documentation. The reason is "No responding" . I am not sure whether I have permissions to fix it ? Since the issues have some relations to the slurm.conf

The solution I found from Slurm Documentation :

If the reason is "Not responding", then check communications between the control machine and the DOWN node using the command "ping

" being sure to specify the NodeAddr values configured in slurm.conf. If ping fails, then fix the network or addresses in slurm.conf.

https://slurm.schedmd.com/troubleshoot.html