open-mmlab / mmtracking

OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.
https://mmtracking.readthedocs.io/en/latest/
Apache License 2.0
3.52k stars 591 forks source link

RuntimeError: Broken pipe #321

Closed zpyi closed 2 years ago

zpyi commented 2 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug

Reproduction

  1. What command or script did you run?
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh configs/vid/fgfa/fgfa_faster_rcnn_r50_dc5_1x_imagenetvid.py 4
  1. Did you make any modifications on the code or config? Did you understand what you have modified? I haven't change the original code or config.

  2. What dataset did you use and what task did you run? ImageNet VID & DET,video object detection. Environment

  3. Please run python mmtrack/utils/collect_env.py to collect necessary environment information and paste it here. sys.platform: linux Python: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0] CUDA available: True GPU 0,1,2,3: GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.1.TC455_06.29069683_0 GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.7.1 PyTorch compiling details: PyTorch built with:

    • GCC 7.3
    • C++ Version: 201402
    • Intel(R) oneAPI Math Kernel Library Version 2021.3-Product Build 20210617 for Intel(R) 64 architecture applications
    • Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
    • OpenMP 201511 (a.k.a. OpenMP 4.5)
    • NNPACK is enabled
    • CPU capability usage: AVX2
    • CUDA Runtime 11.0
    • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_37,code=compute_37
    • CuDNN 8.0.5
    • Magma 2.5.2
    • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.2 OpenCV: 4.5.3 MMCV: 1.3.11 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.0 MMTracking: 0.8.0+bab1abe

  1. You may add addition that may be helpful for locating the problem, such as

    • How you installed PyTorch [e.g., pip, conda, source]

    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback Traceback (most recent call last): File "./tools/train.py", line 175, in main() File "./tools/train.py", line 109, in main init_dist(args.launcher, cfg.dist_params) File "/home/zh/miniconda3/envs/mmtrack/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 20, in init_dist _init_dist_pytorch(backend, kwargs) File "/home/zh/miniconda3/envs/mmtrack/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 34, in _init_dist_pytorch dist.init_process_group(backend=backend, **kwargs) File "/home/zh/miniconda3/envs/mmtrack/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/zh/miniconda3/envs/mmtrack/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: Broken pipe

A placeholder for trackback.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

GT9505 commented 2 years ago

Please update the version of CUDA (>= 11.1), you can try the latest 11.3 version.

zpyi commented 2 years ago

Please update the version of CUDA (>= 11.1), you can try the latest 11.3 version.

Thanks for your reply! This problem has been solved by updating the version of CUDA to 11.1 and mmcv.