Some nodes ( ex scopion101,102... ) turn to state "down" and is not running, Maybe we should fix it. I have chcked the documentation. The reason is "No responding" . I am not sure whether I have permissions to fix it ? Since the issues have some relations to the slurm.conf
The solution I found from Slurm Documentation :
If the reason is "Not responding", then check communications between the control machine and the DOWN node using the command "ping
" being sure to specify the NodeAddr values configured in slurm.conf. If ping fails, then fix the network or addresses in slurm.conf.
Some nodes ( ex scopion101,102... ) turn to state "down" and is not running, Maybe we should fix it. I have chcked the documentation. The reason is "No responding" . I am not sure whether I have permissions to fix it ? Since the issues have some relations to the slurm.conf
The solution I found from Slurm Documentation :
If the reason is "Not responding", then check communications between the control machine and the DOWN node using the command "ping
" being sure to specify the NodeAddr values configured in slurm.conf. If ping fails, then fix the network or addresses in slurm.conf.https://slurm.schedmd.com/troubleshoot.html