A bug about Single Node Multi-GPU Cards Training in windows10

open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.

Apache License 2.0

8.24k stars 2.61k forks source link

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug The following error occurred when I was using single-player multi-card training in Windows10:

$ sh tools/dist_train.sh local_configs\bisenetv2\bisenetv2_fcn_1xb32-amp-160k_cityscapes-512x1024.py 8
NOTE: Redirects are currently not supported in Windows or MacOs.
D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

Reproduction

What command or script did you run?

  sh tools/dist_train.sh local_configs\bisenetv2\bisenetv2_fcn_1xb32-amp-160k_cityscapes-512x1024.py 8

Did you make any modifications on the code or config? Did you understand what you have modified? No changes have been made

Environment

sys.platform: win32 Python: 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-40GB CUDA_HOME: D:\Anaconda3\envs\jbnight NVCC: Cuda compilation tools, release 11.7, V11.7.99 MSVC: 用于 x64 的 Microsoft (R) C/C++ 优化编译器 19.34.31942 版 GCC: n/a PyTorch: 1.13.0 PyTorch compiling details: PyTorch built with:

C++ Version: 199711
MSVC 192829337
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
OpenMP 2019
LAPACK is enabled (usually provided by MKL)
CPU capability usage: AVX2
CUDA Runtime 11.7
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.5
Magma 2.5.4
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/cb/pytorch_1000000000000/work/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.14.0 OpenCV: 4.7.0 MMEngine: 0.7.0 MMSegmentation: 1.0.0rc6+478e28a

$ sh tools/dist_train.sh local_configs\bisenetv2\bisenetv2_fcn_1xb32-amp-160k_cityscapes-512x1024.py 8 NOTE: Redirects are currently not supported in Windows or MacOs. D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

[E C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:860] [c10d] The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500). Traceback (most recent call last): File "D:\Anaconda3\envs\jbnight\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\Anaconda3\envs\jbnight\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launch.py", line 195, in main() File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launch.py", line 191, in main launch(args) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launch.py", line 176, in launch run(args) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\run.py", line 753, in run elastic_launch( File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launcher\api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launcher\api.py", line 237, in launch_agent result = agent.run() File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper result = f(*args, kwargs) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 709, in run result = self._invoke_run(role) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 844, in _invoke_run self._initialize_workers(self._worker_group) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper result = f(*args, *kwargs) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 678, in _initialize_workers self._rendezvous(worker_group) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper result = f(args, kwargs) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 538, in _rendezvous store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous() File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous self._store = TCPStore( # type: ignore[call-arg] TimeoutError: The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500).

open-mmlab / mmsegmentation

A bug about Single Node Multi-GPU Cards Training in windows10 #2873