lyulyul / shine-cluster

Simple High performance Infrastructure for Neural network Experiments
GNU General Public License v3.0
14 stars 8 forks source link

Unexpected failure of user code, causing Hoth to reach 'drain', G-memory residual, and unknown errors of GPU #171

Open wi11Iam5-mpu opened 1 year ago

wi11Iam5-mpu commented 1 year ago

Unexpected failure of multi-card (4) model training (user: wangcui)

[E ProcessGroupNCCL.cpp:737] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=86645, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805320 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=86645, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804609 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=86645, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804483 milliseconds before timing out.
Traceback (most recent call last):
  ... ... 
  File "/home/shared/wangcui/STGTrack/H-TransTrack_test2/models/deformable_detrtrack_train_hybrid_branch.py", line 637, in forward
    torch.distributed.all_reduce(num_boxes)
  File "/home/wangcui/shared/.conda/envs/htranstrack/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1320, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 2.  Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=86645, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805320 milliseconds before timing out.
Traceback (most recent call last):
  ... ...
  File "/home/shared/wangcui/STGTrack/H-TransTrack_test2/models/deformable_detrtrack_train_hybrid_branch.py", line 637, in forward
    torch.distributed.all_reduce(num_boxes)
  File "/home/wangcui/shared/.conda/envs/htranstrack/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1320, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 3.  Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=86645, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804609 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3763588 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3763590 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3763591 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 3763590 via 15, forcefully exitting via 9
client_loop: send disconnect: Connection reset
[已退出进程,代码为 255 (0x000000ff)]

Hoth becomes 'drain' status

wangcui@aha:~$ sinfoF
PARTITION AVAILSTATE TIMELIMIT  GRES   GRES_USED NODELIST  CPUS(A/I/O/T)
... ...
speedy*   up   mix   6:00:00    gpu:8  gpu:8     hoth      18/46/0/64
speedy*   up   mix   6:00:00    gpu:4  gpu:1     tatooine  2/18/0/20
... ...
normal    up   drain 5-00:00:00 gpu:8  gpu:0     dagobah   0/0/64/64
normal    up   mix   5-00:00:00 gpu:8  gpu:8     hoth      18/46/0/64
normal    up   mix   5-00:00:00 gpu:4  gpu:1     tatooine  2/18/0/20

Resume Hoth, find GPU memory residual and Unknown Errors of GPU I used sudo scontrol update nodename=hoth state=resume and Hoth status change to mix. But, those GPUs reported as follows

(htranstrack) wangcui@hoth:~/shared/STGTrack/H-TransTrack_test2$ nvidia-smi
Unable to determine the device handle for GPU 0000:1C:00.0: Unknown Error

and

(htranstrack) wangcui@hoth:~/shared/STGTrack/H-TransTrack_test2$ nvidia-smi
Thu May  4 10:34:41 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:1D:00.0 Off |                  Off |
| 33%   36C    P2    57W / 260W |  41202MiB / 49152MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:1E:00.0 Off |                  Off |
| 33%   36C    P2    53W / 260W |  43516MiB / 49152MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000     On   | 00000000:40:00.0 Off |                  Off |
| 33%   37C    P2    59W / 260W |  38966MiB / 49152MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

What I did To ensure those display cards are alive, I checked their PCI status with the cmd lspci| grep -i nvidia.

1a:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
1a:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
1a:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
1a:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
1c:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev ff)
1c:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev ff)
1c:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev ff)
1c:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev ff)
1d:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
1d:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
1d:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
1d:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
1e:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
1e:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
1e:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
1e:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
3e:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
3e:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
3e:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
3e:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
3f:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
3f:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
3f:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
3f:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
40:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
40:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
40:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
40:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
41:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
41:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
41:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)

Since the output of nvidia-smi does not show any relevant process pid, I try fuser -v /dev/nvidia* to find another residual process and then check each pid with ll /proc/[pid]/fd. But, I am empty-handed. These processes have nothing to do with the faulty cards. Now, I am at my wit's end. Maybe a reboot will fix it, or maybe won't. Consider other exclusion lists if necessary (low probability, I think): driver; power supply; overheating ...

                     USER        PID ACCESS COMMAND
/dev/nvidia0:        wangcui   3528040 F...m python
                     wangcui   3765182 F...m python
                     wangcui   3765183 F...m python
/dev/nvidia4:        wangcui   3527196 F...m python3
                     wangcui   3766184 F...m python3
                     wangcui   3766185 F...m python3
/dev/nvidiactl:      wangcui   3527196 F...m python3
                     wangcui   3528040 F...m python
                     wangcui   3765182 F...m python
                     wangcui   3765183 F...m python
                     wangcui   3766184 F...m python3
                     wangcui   3766185 F...m python3
/dev/nvidia-uvm:     wangcui   3527196 F...m python3
                     wangcui   3528040 F...m python
                     wangcui   3765182 F...m python
                     wangcui   3765183 F...m python
                     wangcui   3766184 F...m python3
                     wangcui   3766185 F...m python3
wi11Iam5-mpu commented 1 year ago

Is this question related to #166 ?

gqqnbig commented 1 year ago

I am empty-handed... Now, I am at my wit's end.

I like these expressions though 😋

luoyuqi-lab commented 1 year ago

Is this question related to #166 ?

Maybe, but the only thing we can confirm is some experiment codes cause trouble (maybe in GPU memory or virtual memory or ...), then kill task fails. Slurm wants to let us know there are bugs so it makes the node become drain.

gqqnbig commented 1 year ago

It looks like users keep blaming the drain. status of slurm.

If users are allowed to directly use GPU, without going through slurm, can they run their jobs smoothly?

luoyuqi-lab commented 1 year ago

image 许多进程是'D'状态,并且我甚至不能手动杀死,这些用户代码是否是drain的原因呢?并且注意到SWP是满的,这意味着什么?参考#166

wi11Iam5-mpu commented 1 year ago

I referred to this website, and the reason for this process state may be that network or disk I/O problems cause the process to sleep. In this state, the process cannot be terminated normally. Considering the situation of our problem, I suspect that it may be related to memory overflow during multi-card training tasks.

gqqnbig commented 1 year ago

Test your code, dude. Stop running buggy code on the cluster!

wi11Iam5-mpu commented 1 year ago

172 needs to be resolved before testing.

wi11Iam5-mpu commented 1 year ago

I referred to this website, and the reason for this process state may be that network or disk I/O problems cause the process to sleep. In this state, the process cannot be terminated normally. Considering the situation of our problem, I suspect that it may be related to memory overflow during multi-card training tasks.

Another possible reason has to do with the DDP over-timing mechanism, (put it here, for memo).

gqqnbig commented 1 year ago

我记得一个conda命令也会令该 process也会进入D status.

我觉得解决conda比较简单,它有专业程序员,写的bug应该比较少。说不定是conda导致服务器坏掉😝