Open xu19971109 opened 1 year ago
Hi, in this line, will this modification help you:
tmpdir = dir_tensor.numpy().tobytes().decode('utf-8').rstrip()
If it does, we'll enhance this part of code ASAP
It doesn't work.
The issue here for me was that the rank 0 GPU was not returning from the forward pass, and for whatever reason the broadcast wasn't blocking. So the returned tmpdir
was a bunch of garbage characters some of which couldn't be decoded.
Prerequisite
Environment
The torch is from nvidia offical docker(nvcr.io/nvidia/pytorch:20.04-py3, torch_version='1.12.0a0+bd13bc6').
environment:
Reproduces the problem - code sample
It comes from Training and testing on multiple GPUs and multiple machines
Reproduces the problem - command or script
It showed up when val at 900/4630 iters. The start script is:
Reproduces the problem - error message
File "/data1/xuxin/code2/mmengine/mmengine/dist/dist.py", line 981, in collect_results_cpu tmpdir = dir_tensor.numpy().tobytes().decode().rstrip() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 0: invalid start byte ubuntu:5119:5150 [0] NCCL INFO comm 0x7f2480009010 rank 8 nranks 16 cudaDev 0 busId 1000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:464] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=90544, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805629 milliseconds before timing out. [E ProcessGroupNCCL.cpp:464] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=90544, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801084 milliseconds before timing out.
Additional information