Multi-node training reports "stop_waiting response required" and "connection reset by peer"

Describe the bug I would like to use remote machines in the cloud for finetuning. I am using a hostfile and have configured ssh for passwordless connection

Using the command deepspeed --hostfile=myHostfile --master_addr 178.116.84.30 --master_port 10700 run_clm.py --deepspeed ds_config_stage3.json ... (further arguments)

produces the following output: deepspeed --hostfile=myHostfile --master_addr 178.116.84.30 --master_port 10700 run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path EleutherAI/gpt-j-6B --train_file train.txt --validation_file validation.txt --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned2 --num_train_epochs 4 --eval_steps 4 --gradient_accumulation_steps 32 --per_device_train_batch_size 8 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 10 --save_steps 48 --save_strategy steps --tokenizer_name gpt2 --load_best_model_at_end=True --block_size=2048 [2023-03-08 23:09:03,464] [INFO] [runner.py:549:main] cmd = /home/max/anaconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJ2YXN0YWkiOiBbMCwgMV19 --master_addr=178.116.84.30 --master_port=10700 --enable_each_rank_log=None run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path EleutherAI/gpt-j-6B --train_file train.txt --validation_file validation.txt --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned2 --num_train_epochs 4 --eval_steps 4 --gradient_accumulation_steps 32 --per_device_train_batch_size 8 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 10 --save_steps 48 --save_strategy steps --tokenizer_name gpt2 --load_best_model_at_end=True --block_size=2048 [2023-03-08 23:09:04,879] [INFO] [launch.py:142:main] WORLD INFO DICT: {'vastai': [0, 1]} [2023-03-08 23:09:04,879] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-03-08 23:09:04,879] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'vastai': [0, 1]}) [2023-03-08 23:09:04,879] [INFO] [launch.py:162:main] dist_world_size=2 [2023-03-08 23:09:04,879] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2023-03-08 23:09:09,759] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Traceback (most recent call last): File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 625, in main() File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 230, in main model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses obj = dtype(inputs) File "", line 109, in init File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1224, in post_init__ and (self.device.type != "cuda") File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1656, in device return self._setup_devices File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get cached = self.fget(obj) File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1591, in _setup_devices deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout)) File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 661, in init_distributed cdb = TorchBackend(dist_backend, timeout, init_method) File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 30, in init__ self.init_process_group(backend, timeout, init_method) File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 34, in init_process_group torch.distributed.init_process_group(backend, File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store return TCPStore( RuntimeError: Stop_waiting response is expected Traceback (most recent call last): File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 625, in main() File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 230, in main model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses obj = dtype(inputs) File "", line 109, in init File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1224, in post_init__ and (self.device.type != "cuda") File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1656, in device return self._setup_devices File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get cached = self.fget(obj) File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1591, in _setup_devices deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout)) File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 661, in init_distributed cdb = TorchBackend(dist_backend, timeout, init_method) File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 30, in init__ self.init_process_group(backend, timeout, init_method) File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 34, in init_process_group torch.distributed.init_process_group(backend, File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 786, in init_process_group _store_based_barrier(rank, store, timeout) File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 318, in _store_based_barrier store.add(store_key, 1) RuntimeError: Connection reset by peer [2023-03-08 23:09:10,901] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 15884 [2023-03-08 23:09:10,901] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 15885 [2023-03-08 23:09:10,911] [ERROR] [launch.py:324:sigkill_handler] ['/home/max/anaconda3/bin/python', '-u', 'run_clm.py', '--local_rank=1', '--deepspeed', 'ds_config_stage3.json', '--model_name_or_path', 'EleutherAI/gpt-j-6B', '--train_file', 'train.txt', '--validation_file', 'validation.txt', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned2', '--num_train_epochs', '4', '--eval_steps', '4', '--gradient_accumulation_steps', '32', '--per_device_train_batch_size', '8', '--use_fast_tokenizer', 'False', '--learning_rate', '5e-06', '--warmup_steps', '10', '--save_total_limit', '10', '--save_steps', '48', '--save_strategy', 'steps', '--tokenizer_name', 'gpt2', '--load_best_model_at_end=True', '--block_size=2048'] exits with return code = 1

To Reproduce Use the following files (IP address and port need to be changed): MyHostfile: vastai slots=2

ssh config file: Host vastai Hostname 178.116.84.30 User root Port 10700

Steps to reproduce the behavior: see above

Expected behavior finetuning should start

logging into the remote machine with "ssh vastai" works fine without any password insertion requirement

ds_report output (base) max@max-5824:/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/$ ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/max/anaconda3/lib/python3.9/site-packages/torch'] torch version .................... 1.13.0 deepspeed install path ........... ['/home/max/anaconda3/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.8.1+867da307, 867da307, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

System info (please complete the following information):

OS: Ubuntu 20.04
GPU 2 x RTX 3090 on remote machine
Python 3.9
Remote machine is somewhere in the cloud

Docker context No Docker image

microsoft / DeepSpeed