lyulyul / shine-cluster

Simple High performance Infrastructure for Neural network Experiments
GNU General Public License v3.0
14 stars 8 forks source link

Reboot dagobah make its slurmd can not run. #172

Closed luoyuqi-lab closed 4 months ago

luoyuqi-lab commented 1 year ago

After reboot, dagobah can not run its slurmd daemon. I think the reason is one GPU is loss after reboot.

image

Only 7 GPUs in the figure.

luoyuqi-lab commented 1 year ago

image

luoyuqi-lab commented 1 year ago

image image image Find zombie process, ps axo stat,ppid,pid,comm | grep -w defunct

gqqnbig commented 1 year ago

Is it fixed now?

luoyuqi-lab commented 1 year ago

image image

Again, we lost 1 GPU then caused slurmd not to run, both in dagobah and hoth. I reboot machines by sudo shutdown -r in dagobah, and sudo reboot in hoth.

They must be some unknown bugs (maybe in codes or in physical machines) about GPU control, I have no idea now.

But I confirm that a physical reboot can make these two nodes become normal.

luoyuqi-lab commented 1 year ago

Lost GPU occurs again. In tatooine, only 3 GPUs but slurm still alive, after apply the bug-gpu, others can be normally used. Use lspci | grep -i nvidia in hoth, rev ff means it is down, so there must problems in GPU 1c. image Tatooine seems normal but why 1 lost? image Dagobah's problem may in GPU 1a. image

These things may be the clue why every remotely reboot makes 1 GPU lost.

luoyuqi-lab commented 1 year ago

Whenever I apply for a dagobah or hoth GPU, I am always given the bad one. So now I'm temporarily applying for the bad GPU to make sure other users can apply for a good card to run their tasks. Temporarily!

luoyuqi-lab commented 1 year ago

看起来很像,丢失一个显卡。错误似乎也是一样的。我准备详细检查nvidia log。 image

luoyuqi-lab commented 1 year ago

dagobah上出问题的是0000:1a这张显卡。找到了可疑的东西,May 26 19:01:41 dagobah kernel: NVRM: GPU 0000:1a:00.0: GPU has fallen off the bus. image

有人说是供电,过热,pcie坏了,或者bios之类引起的。掉显卡这个问题似乎很常见。

luoyuqi-lab commented 1 year ago

物理重启之后(远程shutdown -h now,然后进机房按按钮开关),所有显卡全部恢复,没有掉显卡。hoth和dagobah 8张,tatooine 4张。但是重启遇到了问题,dagobah风扇全转,并且没关机。hoth如图no.1,GPU持续模式守护进程关不掉,风扇没转,进机房按按钮关的。tatooine重启正常。hoth和dagobah肯定有潜在的问题,系统或者硬件。 406a8c601dc97b52cdec0d6d373bd00 image image image

luoyuqi-lab commented 10 months ago

看到了同样的问题,查看nvidia-bug-report.log.gz暂时没找到线索。对于tatooine,准备寻求lambda客服的帮助。

gqqnbig commented 4 months ago

But I confirm that a physical reboot can make these two nodes become normal.

Rather than physical reboot, try https://manpages.ubuntu.com/manpages/trusty/man8/nvram-wakeup.8.html

gqqnbig commented 4 months ago

The so-called 'physical reboot' coincides cold reboot.

Never underestimate a concept learned in your teens!

lyulyul commented 4 months ago

dagobah frequent problems with draining and dropping the video card, the possible causes of failure speculation, one is the frequent dropping of that video card is problematic, the second may be the chassis invasion. Instead of going to the server room to force a shutdown to recover, you can control the shutdown remotely. Install IPMIView on the client, and the same can be done remotely by switching off and restoring the dagobah temporarily. IPMIView can remotely view the status of the server, temperature, power, and can be visualized, it is a very good tool. following URL: https://github.com/luoyuqi-lab/shine-cluster/wiki/Power-Cycle-Server

lyulyul commented 4 months ago

Dagobah is like a weathered but fighting old dog, we need to take more care of his health. 1708571549339

lyulyul commented 4 months ago

Please check the wiki page for 'physically reboot' a failed machine https://github.com/luoyuqi-lab/shine-cluster/wiki/Power-Cycle-Server

With all due respect, I recommend to close this aged stubborn issue. If you have separate concerns, please create a new issue.