huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
8k stars 976 forks source link

Distributed training on multi-node of containers failed. #2107

Closed jimmysue closed 11 months ago

jimmysue commented 1 year ago

I have two docker containers as two training nodes. And using diffusers/text_to_image example to run multi-node distributed training. Two containers' hosts are in the same network. And I map port 9001 on container to the host.

I config main node as below:

In which compute environment are you running?
This machine                                                                                                                                                                                                                                                  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?                                                                                                                                                                                                                          
multi-GPU                                                                                                                                                                                                                                                     
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2                                                                                                                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What is the rank of this machine?                                                                                                                                                                                                                             
0                                                                                                                                                                                                                                                             
What is the IP address of the machine that will host the main process? 172.30.2.38                                                                                                                                                                            
What is the port you will use to communicate with the main process? 9001                                                                                                                                                                                      
Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: 
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: 
Do you wish to optimize your script with torch dynamo?[yes/NO]:
Do you want to use DeepSpeed? [yes/NO]: 
Do you want to use FullyShardedDataParallel? [yes/NO]: 
Do you want to use Megatron-LM ? [yes/NO]: 
How many GPU(s) should be used for distributed training? [1]:8
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:

and config the other one as below

In which compute environment are you running?
This machine                                                                                                                                                                                                                                                  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?                                                                                                                                                                                                                          
multi-GPU                                                                                                                                                                                                                                                     
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2                                                                                                                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What is the rank of this machine?                                                                                                                                                                                                                             
1                                                                                                                                                                                                                                                             
What is the IP address of the machine that will host the main process? 172.30.2.38                                                                                                                                                                            
What is the port you will use to communicate with the main process? 9001                                                                                                                                                                                      
Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: 
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: 
Do you wish to optimize your script with torch dynamo?[yes/NO]:
Do you want to use DeepSpeed? [yes/NO]: 
Do you want to use FullyShardedDataParallel? [yes/NO]: 
Do you want to use Megatron-LM ? [yes/NO]: 
How many GPU(s) should be used for distributed training? [1]:8
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:

The only difference is to set different rank for machines.

When I launch the training scripts. The script stuck on main node and print below messages:

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.                                                                                                                              
11/01/2023 15:42:14 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl                                                                                                                                                                                                                                  
Num processes: 8
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

11/01/2023 15:42:14 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

{'prediction_type', 'timestep_spacing', 'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'sample_max_value', 'thresholding'} was not found in config. Values will be initialized to default values.
11/01/2023 15:42:14 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

11/01/2023 15:42:14 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

{'norm_num_groups', 'force_upcast'} was not found in config. Values will be initialized to default values.
{'transformer_layers_per_block', 'attention_type', 'num_class_embeds', 'addition_time_embed_dim', 'reverse_transformer_layers_per_block', 'time_cond_proj_dim', 'time_embedding_type', 'time_embedding_act_fn', 'cross_attention_norm', 'only_cross_attention', 'num_attention_heads', 'mid_block_type', 'projection_class_embeddings_input_dim', 'conv_out_kernel', 'upcast_attention', 'timestep_post_act', 'addition_embed_type_num_heads', 'encoder_hid_dim', 'mid_block_only_cross_attention', 'encoder_hid_dim_type', 'dual_cross_attention', 'conv_in_kernel', 'dropout', 'resnet_out_scale_factor', 'resnet_time_scale_shift', 'time_embedding_dim', 'resnet_skip_time_act', 'class_embeddings_concat', 'addition_embed_type', 'class_embed_type', 'use_linear_projection'} was not found in config. Values will be initialized to default values.
{'transformer_layers_per_block', 'attention_type', 'num_class_embeds', 'addition_time_embed_dim', 'reverse_transformer_layers_per_block', 'time_cond_proj_dim', 'time_embedding_type', 'time_embedding_act_fn', 'cross_attention_norm', 'only_cross_attention', 'num_attention_heads', 'mid_block_type', 'projection_class_embeddings_input_dim', 'conv_out_kernel', 'upcast_attention', 'timestep_post_act', 'addition_embed_type_num_heads', 'encoder_hid_dim', 'mid_block_only_cross_attention', 'encoder_hid_dim_type', 'dual_cross_attention', 'conv_in_kernel', 'dropout', 'resnet_out_scale_factor', 'resnet_time_scale_shift', 'time_embedding_dim', 'resnet_skip_time_act', 'class_embeddings_concat', 'addition_embed_type', 'class_embed_type', 'use_linear_projection'} was not found in config. Values will be initialized to default values.
desktop_sjz:396298:396298 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0>
desktop_sjz:396298:396298 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
desktop_sjz:396298:396298 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
desktop_sjz:396298:396298 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.18.1+cuda12.1
desktop_sjz:396303:396303 [3] NCCL INFO cudaDriverVersion 12000
desktop_sjz:396300:396300 [1] NCCL INFO cudaDriverVersion 12000
desktop_sjz:396300:396300 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0>
desktop_sjz:396303:396303 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0>
desktop_sjz:396300:396300 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
desktop_sjz:396300:396300 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
desktop_sjz:396301:396301 [2] NCCL INFO cudaDriverVersion 12000
desktop_sjz:396303:396303 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
desktop_sjz:396303:396303 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
desktop_sjz:396301:396301 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0>
desktop_sjz:396301:396301 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
desktop_sjz:396301:396301 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
desktop_sjz:396298:396687 [0] NCCL INFO Failed to open libibverbs.so[.1]
desktop_sjz:396300:396689 [1] NCCL INFO Failed to open libibverbs.so[.1]
desktop_sjz:396298:396687 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
desktop_sjz:396298:396687 [0] NCCL INFO Using network Socket
desktop_sjz:396300:396689 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
desktop_sjz:396300:396689 [1] NCCL INFO Using network Socket
desktop_sjz:396301:396690 [2] NCCL INFO Failed to open libibverbs.so[.1]
desktop_sjz:396301:396690 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
desktop_sjz:396301:396690 [2] NCCL INFO Using network Socket
desktop_sjz:396303:396688 [3] NCCL INFO Failed to open libibverbs.so[.1]
desktop_sjz:396303:396688 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
desktop_sjz:396303:396688 [3] NCCL INFO Using network Socket

On the other node, failed, and prints errors below:

11/01/2023 15:41:41 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl                                                                                                                                                                     
Num processes: 8
Process index: 7
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

11/01/2023 15:41:42 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 4
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

{'thresholding', 'variance_type', 'clip_sample_range', 'prediction_type', 'dynamic_thresholding_ratio', 'sample_max_value', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
11/01/2023 15:41:42 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 6
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

11/01/2023 15:41:42 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 5
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

{'norm_num_groups', 'force_upcast'} was not found in config. Values will be initialized to default values.
{'transformer_layers_per_block', 'class_embed_type', 'mid_block_type', 'encoder_hid_dim', 'projection_class_embeddings_input_dim', 'timestep_post_act', 'addition_embed_type', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'time_embedding_type', 'reverse_transformer_layers_per_block', 'addition_embed_type_num_heads', 'dual_cross_attention', 'attention_type', 'cross_attention_norm', 'mid_block_only_cross_attention', 'dropout', 'upcast_attention', 'num_class_embeds', 'resnet_skip_time_act', 'class_embeddings_concat', 'resnet_time_scale_shift', 'only_cross_attention', 'time_embedding_act_fn', 'use_linear_projection', 'encoder_hid_dim_type', 'conv_in_kernel', 'conv_out_kernel', 'time_embedding_dim', 'time_cond_proj_dim', 'num_attention_heads'} was not found in config. Values will be initialized to default values.
{'transformer_layers_per_block', 'class_embed_type', 'mid_block_type', 'encoder_hid_dim', 'projection_class_embeddings_input_dim', 'timestep_post_act', 'addition_embed_type', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'time_embedding_type', 'reverse_transformer_layers_per_block', 'addition_embed_type_num_heads', 'dual_cross_attention', 'attention_type', 'cross_attention_norm', 'mid_block_only_cross_attention', 'dropout', 'upcast_attention', 'num_class_embeds', 'resnet_skip_time_act', 'class_embeddings_concat', 'resnet_time_scale_shift', 'only_cross_attention', 'time_embedding_act_fn', 'use_linear_projection', 'encoder_hid_dim_type', 'conv_in_kernel', 'conv_out_kernel', 'time_embedding_dim', 'time_cond_proj_dim', 'num_attention_heads'} was not found in config. Values will be initialized to default values.
desktop_sjz:339252:339252 [1] NCCL INFO cudaDriverVersion 12000
desktop_sjz:339251:339251 [0] NCCL INFO cudaDriverVersion 12000
desktop_sjz:339252:339252 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
desktop_sjz:339251:339251 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
desktop_sjz:339252:339252 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
desktop_sjz:339252:339252 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
desktop_sjz:339251:339251 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
desktop_sjz:339251:339251 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
desktop_sjz:339253:339253 [2] NCCL INFO cudaDriverVersion 12000
desktop_sjz:339255:339255 [3] NCCL INFO cudaDriverVersion 12000
desktop_sjz:339253:339253 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
desktop_sjz:339255:339255 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
desktop_sjz:339255:339255 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
desktop_sjz:339253:339253 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
desktop_sjz:339255:339255 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
desktop_sjz:339253:339253 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
desktop_sjz:339252:339679 [1] NCCL INFO Failed to open libibverbs.so[.1]
desktop_sjz:339252:339679 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
desktop_sjz:339252:339679 [1] NCCL INFO Using network Socket
desktop_sjz:339251:339680 [0] NCCL INFO Failed to open libibverbs.so[.1]
desktop_sjz:339251:339680 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
desktop_sjz:339251:339680 [0] NCCL INFO Using network Socket
desktop_sjz:339253:339682 [2] NCCL INFO Failed to open libibverbs.so[.1]
desktop_sjz:339253:339682 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
desktop_sjz:339253:339682 [2] NCCL INFO Using network Socket
desktop_sjz:339255:339681 [3] NCCL INFO Failed to open libibverbs.so[.1]
desktop_sjz:339255:339681 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
desktop_sjz:339255:339681 [3] NCCL INFO Using network Socket
desktop_sjz:339252:339679 [1] NCCL INFO misc/socket.cc:564 -> 2
desktop_sjz:339253:339682 [2] NCCL INFO misc/socket.cc:564 -> 2
desktop_sjz:339255:339681 [3] NCCL INFO misc/socket.cc:564 -> 2
desktop_sjz:339251:339680 [0] NCCL INFO misc/socket.cc:564 -> 2
desktop_sjz:339252:339679 [1] NCCL INFO misc/socket.cc:615 -> 2
desktop_sjz:339253:339682 [2] NCCL INFO misc/socket.cc:615 -> 2
desktop_sjz:339255:339681 [3] NCCL INFO misc/socket.cc:615 -> 2
desktop_sjz:339251:339680 [0] NCCL INFO misc/socket.cc:615 -> 2
desktop_sjz:339255:339681 [3] NCCL INFO bootstrap.cc:270 -> 2
desktop_sjz:339252:339679 [1] NCCL INFO bootstrap.cc:270 -> 2
desktop_sjz:339253:339682 [2] NCCL INFO bootstrap.cc:270 -> 2
desktop_sjz:339251:339680 [0] NCCL INFO bootstrap.cc:270 -> 2
desktop_sjz:339255:339681 [3] NCCL INFO init.cc:1303 -> 2
desktop_sjz:339252:339679 [1] NCCL INFO init.cc:1303 -> 2
desktop_sjz:339251:339680 [0] NCCL INFO init.cc:1303 -> 2
desktop_sjz:339253:339682 [2] NCCL INFO init.cc:1303 -> 2
desktop_sjz:339255:339681 [3] NCCL INFO group.cc:64 -> 2 [Async thread]
desktop_sjz:339252:339679 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
desktop_sjz:339251:339680 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
desktop_sjz:339253:339682 [2] NCCL INFO group.cc:64 -> 2 [Async thread]
desktop_sjz:339251:339251 [0] NCCL INFO group.cc:422 -> 2
desktop_sjz:339252:339252 [1] NCCL INFO group.cc:422 -> 2
desktop_sjz:339255:339255 [3] NCCL INFO group.cc:422 -> 2
desktop_sjz:339252:339252 [1] NCCL INFO group.cc:106 -> 2
desktop_sjz:339251:339251 [0] NCCL INFO group.cc:106 -> 2
desktop_sjz:339253:339253 [2] NCCL INFO group.cc:422 -> 2
desktop_sjz:339255:339255 [3] NCCL INFO group.cc:106 -> 2
desktop_sjz:339253:339253 [2] NCCL INFO group.cc:106 -> 2
Traceback (most recent call last):
  File "train_text_to_image.py", line 1066, in <module>
Traceback (most recent call last):
  File "train_text_to_image.py", line 1066, in <module>
Traceback (most recent call last):
  File "train_text_to_image.py", line 1066, in <module>
    main()
  File "train_text_to_image.py", line 758, in main
    main()
  File "train_text_to_image.py", line 758, in main
    main()
  File "train_text_to_image.py", line 758, in main
    with accelerator.main_process_first():
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)    
with accelerator.main_process_first():    
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 816, in main_process_first
with accelerator.main_process_first():  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__

  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
      File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 816, in main_process_first
return next(self.gen)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 816, in main_process_first
    with self.state.main_process_first():
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 930, in main_process_first
    with self.state.main_process_first():
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
    with self.state.main_process_first():
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 930, in main_process_first
    return next(self.gen)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 930, in main_process_first
    with PartialState().main_process_first():
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 487, in main_process_first
    with PartialState().main_process_first():
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
Traceback (most recent call last):
  File "train_text_to_image.py", line 1066, in <module>
    return next(self.gen)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 487, in main_process_first
    with PartialState().main_process_first():
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
    yield from self._goes_first(self.is_main_process)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 382, in _goes_first
    return next(self.gen)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 487, in main_process_first
    yield from self._goes_first(self.is_main_process)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 382, in _goes_first
    self.wait_for_everyone()
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 376, in wait_for_everyone
    yield from self._goes_first(self.is_main_process)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 382, in _goes_first
    self.wait_for_everyone()
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 376, in wait_for_everyone
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
        self.wait_for_everyone()torch.distributed.barrier()

  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 376, in wait_for_everyone
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    main()
      File "train_text_to_image.py", line 758, in main
torch.distributed.barrier()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    with accelerator.main_process_first():
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 816, in main_process_first
    work = default_pg.barrier(opts=opts)    
with self.state.main_process_first():
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

    return next(self.gen)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 930, in main_process_first
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

    with PartialState().main_process_first():
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

    return next(self.gen)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 487, in main_process_first
    yield from self._goes_first(self.is_main_process)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 382, in _goes_first
    self.wait_for_everyone()
  File "/opt/conda/lib/python3.8/site-packages/accelerate/state.py", line 376, in wait_for_everyone
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

desktop_sjz:339251:339251 [0] NCCL INFO comm 0x55c19e10d210 rank 4 nranks 8 cudaDev 0 busId 31000 - Abort COMPLETE
desktop_sjz:339255:339255 [3] NCCL INFO comm 0x55db0c8da4e0 rank 7 nranks 8 cudaDev 3 busId ca000 - Abort COMPLETE
desktop_sjz:339253:339253 [2] NCCL INFO comm 0x55637f643f40 rank 6 nranks 8 cudaDev 2 busId b1000 - Abort COMPLETE
desktop_sjz:339252:339252 [1] NCCL INFO comm 0x5567073ffd00 rank 5 nranks 8 cudaDev 1 busId 4b000 - Abort COMPLETE
[2023-11-01 15:42:23,825] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 339251) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_text_to_image.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-11-01_15:42:23
  host      : desktop_sjz
  rank      : 5 (local_rank: 1)
  exitcode  : 1 (pid: 339252)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-11-01_15:42:23
  host      : desktop_sjz
  rank      : 6 (local_rank: 2)
  exitcode  : 1 (pid: 339253)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-11-01_15:42:23
  host      : desktop_sjz
  rank      : 7 (local_rank: 3)
  exitcode  : 1 (pid: 339255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-01_15:42:23
  host      : desktop_sjz
  rank      : 4 (local_rank: 0)
  exitcode  : 1 (pid: 339251)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Why the training failed, did I make something wrong, please help.

BenjaminBossan commented 1 year ago

Not sure if this is the source of your error, but your Linux kernel is relatively old, which can lead to the process hanging, as this warning shows:

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

Is it possible for you to upgrade the kernel to a more recent version?

jimmysue commented 1 year ago

Not sure if this is the source of your error, but your Linux kernel is relatively old, which can lead to the process hanging, as this warning shows:

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

Is it possible for you to upgrade the kernel to a more recent version?

Finally, I got it work by run the docker with --network=host. I wonder if it is possible to work with bridge network in docker?

github-actions[bot] commented 12 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.