lyulyul / shine-cluster

Simple High performance Infrastructure for Neural network Experiments
GNU General Public License v3.0
14 stars 8 forks source link

Check fan speed and tempurature #174

Open gqqnbig opened 1 year ago

gqqnbig commented 1 year ago

In our last server room visit, we noticed a server was super loud, and we rebooted it.

The loud noise may have something to do with the drain status of slurm or the inability to kill a task. Even if not, you don't want to see a server meltdown in your reign.

It's not supposed to be a one-time check, or that "I assure you I will check it everyday, if I remember and I have time."

Try this command:

IPMICFG –sdr

luoyuqi-lab commented 1 year ago

Seems our nodes without IPMICFG now, #87 may guide us install.

luoyuqi-lab commented 1 year ago

ipmicfg in /shine-cluster/supermicro/. Use sudo ./ipmicfg -sdr to check fan speed.