Open wi11Iam5-mpu opened 1 year ago
Is this question related to #166 ?
I am empty-handed... Now, I am at my wit's end.
I like these expressions though 😋
Is this question related to #166 ?
Maybe, but the only thing we can confirm is some experiment codes cause trouble (maybe in GPU memory or virtual memory or ...), then kill task fails. Slurm wants to let us know there are bugs so it makes the node become drain
.
It looks like users keep blaming the drain. status of slurm.
If users are allowed to directly use GPU, without going through slurm, can they run their jobs smoothly?
许多进程是'D'状态,并且我甚至不能手动杀死,这些用户代码是否是drain的原因呢?并且注意到SWP是满的,这意味着什么?参考#166
I referred to this website, and the reason for this process state may be that network or disk I/O problems cause the process to sleep. In this state, the process cannot be terminated normally. Considering the situation of our problem, I suspect that it may be related to memory overflow during multi-card training tasks.
Test your code, dude. Stop running buggy code on the cluster!
I referred to this website, and the reason for this process state may be that network or disk I/O problems cause the process to sleep. In this state, the process cannot be terminated normally. Considering the situation of our problem, I suspect that it may be related to memory overflow during multi-card training tasks.
Another possible reason has to do with the DDP over-timing mechanism, (put it here, for memo).
我记得一个conda命令也会令该 process也会进入D status.
我觉得解决conda比较简单,它有专业程序员,写的bug应该比较少。说不定是conda导致服务器坏掉😝
Unexpected failure of multi-card (4) model training (user: wangcui)
Hoth becomes 'drain' status
Resume Hoth, find GPU memory residual and Unknown Errors of GPU I used
sudo scontrol update nodename=hoth state=resume
and Hoth status change tomix
. But, those GPUs reported as followsand
What I did To ensure those display cards are alive, I checked their PCI status with the cmd
lspci| grep -i nvidia
.Since the output of
nvidia-smi
does not show any relevant process pid, I tryfuser -v /dev/nvidia*
to find another residual process and then check each pid withll /proc/[pid]/fd
. But, I am empty-handed. These processes have nothing to do with the faulty cards. Now, I am at my wit's end. Maybe a reboot will fix it, or maybe won't. Consider other exclusion lists if necessary (low probability, I think): driver; power supply; overheating ...