Closed luoyuqi-lab closed 9 months ago
Find zombie process, ps axo stat,ppid,pid,comm | grep -w defunct
Is it fixed now?
sudo shutdown -r
in dagobah, and sudo reboot
in hoth.Lost GPU occurs again. In tatooine, only 3 GPUs but slurm still alive, after apply the bug-gpu, others can be normally used.
Use lspci | grep -i nvidia
in hoth, rev ff
means it is down, so there must problems in GPU 1c.
Tatooine seems normal but why 1 lost?
Dagobah's problem may in GPU 1a.
These things may be the clue why every remotely reboot makes 1 GPU lost.
Whenever I apply for a dagobah or hoth GPU, I am always given the bad one. So now I'm temporarily applying for the bad GPU to make sure other users can apply for a good card to run their tasks. Temporarily!
看起来很像,丢失一个显卡。错误似乎也是一样的。我准备详细检查nvidia log。
dagobah上出问题的是0000:1a这张显卡。找到了可疑的东西,May 26 19:01:41 dagobah kernel: NVRM: GPU 0000:1a:00.0: GPU has fallen off the bus.
有人说是供电,过热,pcie坏了,或者bios之类引起的。掉显卡这个问题似乎很常见。
物理重启之后(远程shutdown -h now,然后进机房按按钮开关),所有显卡全部恢复,没有掉显卡。hoth和dagobah 8张,tatooine 4张。但是重启遇到了问题,dagobah风扇全转,并且没关机。hoth如图no.1,GPU持续模式守护进程关不掉,风扇没转,进机房按按钮关的。tatooine重启正常。hoth和dagobah肯定有潜在的问题,系统或者硬件。
看到了同样的问题,查看nvidia-bug-report.log.gz暂时没找到线索。对于tatooine,准备寻求lambda客服的帮助。
But I confirm that a physical reboot can make these two nodes become normal.
Rather than physical reboot, try https://manpages.ubuntu.com/manpages/trusty/man8/nvram-wakeup.8.html
The so-called 'physical reboot' coincides cold reboot.
Never underestimate a concept learned in your teens!
dagobah frequent problems with draining and dropping the video card, the possible causes of failure speculation, one is the frequent dropping of that video card is problematic, the second may be the chassis invasion. Instead of going to the server room to force a shutdown to recover, you can control the shutdown remotely. Install IPMIView on the client, and the same can be done remotely by switching off and restoring the dagobah temporarily. IPMIView can remotely view the status of the server, temperature, power, and can be visualized, it is a very good tool. following URL: https://github.com/luoyuqi-lab/shine-cluster/wiki/Power-Cycle-Server
Dagobah is like a weathered but fighting old dog, we need to take more care of his health.
Please check the wiki page for 'physically reboot' a failed machine https://github.com/luoyuqi-lab/shine-cluster/wiki/Power-Cycle-Server
With all due respect, I recommend to close this aged stubborn issue. If you have separate concerns, please create a new issue.
After reboot, dagobah can not run its slurmd daemon. I think the reason is one GPU is loss after reboot.
Only 7 GPUs in the figure.