open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
8.24k stars 2.61k forks source link

A bug about Single Node Multi-GPU Cards Training in windows10 #2873

Open SXQ-STUDY opened 1 year ago

SXQ-STUDY commented 1 year ago

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug The following error occurred when I was using single-player multi-card training in Windows10:

$ sh tools/dist_train.sh local_configs\bisenetv2\bisenetv2_fcn_1xb32-amp-160k_cityscapes-512x1024.py 8
NOTE: Redirects are currently not supported in Windows or MacOs.
D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

Reproduction

  1. What command or script did you run?

      sh tools/dist_train.sh local_configs\bisenetv2\bisenetv2_fcn_1xb32-amp-160k_cityscapes-512x1024.py 8
  2. Did you make any modifications on the code or config? Did you understand what you have modified? No changes have been made

Environment

sys.platform: win32 Python: 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-40GB CUDA_HOME: D:\Anaconda3\envs\jbnight NVCC: Cuda compilation tools, release 11.7, V11.7.99 MSVC: 用于 x64 的 Microsoft (R) C/C++ 优化编译器 19.34.31942 版 GCC: n/a PyTorch: 1.13.0 PyTorch compiling details: PyTorch built with:

TorchVision: 0.14.0 OpenCV: 4.7.0 MMEngine: 0.7.0 MMSegmentation: 1.0.0rc6+478e28a

SXQ-STUDY commented 1 year ago

$ sh tools/dist_train.sh local_configs\bisenetv2\bisenetv2_fcn_1xb32-amp-160k_cityscapes-512x1024.py 8 NOTE: Redirects are currently not supported in Windows or MacOs. D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[E C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:860] [c10d] The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500). Traceback (most recent call last): File "D:\Anaconda3\envs\jbnight\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\Anaconda3\envs\jbnight\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launch.py", line 195, in main() File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launch.py", line 191, in main launch(args) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launch.py", line 176, in launch run(args) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\run.py", line 753, in run elastic_launch( File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launcher\api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\launcher\api.py", line 237, in launch_agent result = agent.run() File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper result = f(*args, kwargs) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 709, in run result = self._invoke_run(role) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 844, in _invoke_run self._initialize_workers(self._worker_group) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper result = f(*args, *kwargs) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 678, in _initialize_workers self._rendezvous(worker_group) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper result = f(args, kwargs) File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 538, in _rendezvous store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous() File "D:\Anaconda3\envs\jbnight\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous self._store = TCPStore( # type: ignore[call-arg] TimeoutError: The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500).

wujiang0156 commented 1 year ago

I have the same question? can anyone help