open-mmlab / mmengine

OpenMMLab Foundational Library for Training Deep Learning Models
https://mmengine.readthedocs.io/
Apache License 2.0
1.14k stars 340 forks source link

'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation bytereturn collect_results_cpu(results, size, tmpdir) #1178

Open xu19971109 opened 1 year ago

xu19971109 commented 1 year ago

Prerequisite

Environment

The torch is from nvidia offical docker(nvcr.io/nvidia/pytorch:20.04-py3, torch_version='1.12.0a0+bd13bc6').

environment:

OrderedDict([
('sys.platform', 'linux'), 
('Python', '3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]'), 
('CUDA available', True), 
('numpy_random_seed', 2147483648), 
('GPU 0,1,2,3,4,5,6,7', 'NVIDIA GeForce RTX 3090'), 
('CUDA_HOME', '/usr/local/cuda'), 
('NVCC', 'Cuda compilation tools, release 11.6, V11.6.124'), 
('GCC', 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0'), 
('PyTorch', '1.12.0a0+bd13bc6'), 
('PyTorch compiling details', 'PyTorch built with:\n  
     - GCC 9.4\n  
     - C++ Version: 201402\n  
     - Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications\n  
     - Intel(R) MKL-DNN v2.5.2 (Git Hash N/A)\n  
     - OpenMP 201511 (a.k.a. OpenMP 4.5)\n  
     - LAPACK is enabled (usually provided by MKL)\n  
     - NNPACK is enabled\n  
     - CPU capability usage: AVX2\n  
     - CUDA Runtime 11.6\n  
     - NVCC architecture flags: 
         -gencode;arch=compute_52,code=sm_52;
         -gencode;arch=compute_60,code=sm_60;
         -gencode;arch=compute_61,code=sm_61;
         -gencode;arch=compute_70,code=sm_70;
         -gencode;arch=compute_75,code=sm_75;
         -gencode;arch=compute_80,code=sm_80;
         -gencode;arch=compute_86,code=sm_86;
         -gencode;arch=compute_86,code=compute_86\n  
   - CuDNN 8.4\n  
   - Magma 2.5.2\n  
   - Build settings: - BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.4.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), 
   ('TorchVision', '0.13.0a0'), 
   ('OpenCV', '3.4.11'), 
   ('MMEngine', '0.7.3')])

Reproduces the problem - code sample

It comes from Training and testing on multiple GPUs and multiple machines

tmpdir = dir_tensor.numpy().tobytes().decode().rstrip()

Reproduces the problem - command or script

It showed up when val at 900/4630 iters. The start script is:

export NCCL_SOCKET_IFNAME=xxxxxxx
export GLOO_SOCKET_IFNAME=xxxxxxx
export NCCL_DEBUG=INFO

export NCCL_IB_DISABLE=1
export WORLD_SIZE=2

export CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7'
python -m torch.distributed.launch \
        --nnodes=2 \
         --node_rank=1 \
         --master_addr='ip_addr' \
         --nproc_per_node=8 \
         --master_port=29700 \
         ./tools/train.py \
         configs/_freespace_/pidnet-l.py \
         --resume \
        --launcher pytorch ${@:3}

Reproduces the problem - error message

File "/data1/xuxin/code2/mmengine/mmengine/dist/dist.py", line 981, in collect_results_cpu tmpdir = dir_tensor.numpy().tobytes().decode().rstrip() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 0: invalid start byte ubuntu:5119:5150 [0] NCCL INFO comm 0x7f2480009010 rank 8 nranks 16 cudaDev 0 busId 1000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:464] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=90544, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805629 milliseconds before timing out. [E ProcessGroupNCCL.cpp:464] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=90544, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801084 milliseconds before timing out.

Additional information

  1. I used my own dataset, about 1.5M pictures, I trained them on mmsegmentation 1.0.0
  2. When I trained on Single machine and multiple GPUs, it is successful!
  3. Another problem was that the total iters of val remained the same after Training and testing on multiple GPUs and multiple machines, but GPUs utilization was normal for both machines
HAOCHENYE commented 1 year ago

Hi, in this line, will this modification help you:

tmpdir = dir_tensor.numpy().tobytes().decode('utf-8').rstrip()

If it does, we'll enhance this part of code ASAP

xu19971109 commented 1 year ago

It doesn't work.

collinmccarthy commented 1 month ago

The issue here for me was that the rank 0 GPU was not returning from the forward pass, and for whatever reason the broadcast wasn't blocking. So the returned tmpdir was a bunch of garbage characters some of which couldn't be decoded.