Describe the bug
I would like to use remote machines in the cloud for finetuning.
I am using a hostfile and have configured ssh for passwordless connection
Using the command
deepspeed --hostfile=myHostfile --master_addr 178.116.84.30 --master_port 10700 run_clm.py --deepspeed ds_config_stage3.json ... (further arguments)
produces the following output:
deepspeed --hostfile=myHostfile --master_addr 178.116.84.30 --master_port 10700 run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path EleutherAI/gpt-j-6B --train_file train.txt --validation_file validation.txt --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned2 --num_train_epochs 4 --eval_steps 4 --gradient_accumulation_steps 32 --per_device_train_batch_size 8 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 10 --save_steps 48 --save_strategy steps --tokenizer_name gpt2 --load_best_model_at_end=True --block_size=2048
[2023-03-08 23:09:03,464] [INFO] [runner.py:549:main] cmd = /home/max/anaconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJ2YXN0YWkiOiBbMCwgMV19 --master_addr=178.116.84.30 --master_port=10700 --enable_each_rank_log=None run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path EleutherAI/gpt-j-6B --train_file train.txt --validation_file validation.txt --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned2 --num_train_epochs 4 --eval_steps 4 --gradient_accumulation_steps 32 --per_device_train_batch_size 8 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 10 --save_steps 48 --save_strategy steps --tokenizer_name gpt2 --load_best_model_at_end=True --block_size=2048
[2023-03-08 23:09:04,879] [INFO] [launch.py:142:main] WORLD INFO DICT: {'vastai': [0, 1]}
[2023-03-08 23:09:04,879] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-03-08 23:09:04,879] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'vastai': [0, 1]})
[2023-03-08 23:09:04,879] [INFO] [launch.py:162:main] dist_world_size=2
[2023-03-08 23:09:04,879] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-03-08 23:09:09,759] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 625, in
main()
File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 230, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
obj = dtype(inputs)
File "", line 109, in init
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1224, in post_init__
and (self.device.type != "cuda")
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1656, in device
return self._setup_devices
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1591, in _setup_devices
deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout))
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 661, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method)
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 30, in init__
self.init_process_group(backend, timeout, init_method)
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 34, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
return TCPStore(
RuntimeError: Stop_waiting response is expected
Traceback (most recent call last):
File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 625, in
main()
File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 230, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
obj = dtype(inputs)
File "", line 109, in init
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1224, in post_init__
and (self.device.type != "cuda")
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1656, in device
return self._setup_devices
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1591, in _setup_devices
deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout))
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 661, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method)
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 30, in init__
self.init_process_group(backend, timeout, init_method)
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 34, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 786, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 318, in _store_based_barrier
store.add(store_key, 1)
RuntimeError: Connection reset by peer
[2023-03-08 23:09:10,901] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 15884
[2023-03-08 23:09:10,901] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 15885
[2023-03-08 23:09:10,911] [ERROR] [launch.py:324:sigkill_handler] ['/home/max/anaconda3/bin/python', '-u', 'run_clm.py', '--local_rank=1', '--deepspeed', 'ds_config_stage3.json', '--model_name_or_path', 'EleutherAI/gpt-j-6B', '--train_file', 'train.txt', '--validation_file', 'validation.txt', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned2', '--num_train_epochs', '4', '--eval_steps', '4', '--gradient_accumulation_steps', '32', '--per_device_train_batch_size', '8', '--use_fast_tokenizer', 'False', '--learning_rate', '5e-06', '--warmup_steps', '10', '--save_total_limit', '10', '--save_steps', '48', '--save_strategy', 'steps', '--tokenizer_name', 'gpt2', '--load_best_model_at_end=True', '--block_size=2048'] exits with return code = 1
To Reproduce
Use the following files (IP address and port need to be changed):
MyHostfile:
vastai slots=2
ssh config file:
Host vastai
Hostname 178.116.84.30
User root
Port 10700
Steps to reproduce the behavior:
see above
Expected behavior
finetuning should start
logging into the remote machine with "ssh vastai" works fine without any password insertion requirement
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
DeepSpeed general environment info:
torch install path ............... ['/home/max/anaconda3/lib/python3.9/site-packages/torch']
torch version .................... 1.13.0
deepspeed install path ........... ['/home/max/anaconda3/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.8.1+867da307, 867da307, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
System info (please complete the following information):
Describe the bug I would like to use remote machines in the cloud for finetuning. I am using a hostfile and have configured ssh for passwordless connection
Using the command deepspeed --hostfile=myHostfile --master_addr 178.116.84.30 --master_port 10700 run_clm.py --deepspeed ds_config_stage3.json ... (further arguments)
produces the following output: deepspeed --hostfile=myHostfile --master_addr 178.116.84.30 --master_port 10700 run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path EleutherAI/gpt-j-6B --train_file train.txt --validation_file validation.txt --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned2 --num_train_epochs 4 --eval_steps 4 --gradient_accumulation_steps 32 --per_device_train_batch_size 8 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 10 --save_steps 48 --save_strategy steps --tokenizer_name gpt2 --load_best_model_at_end=True --block_size=2048 [2023-03-08 23:09:03,464] [INFO] [runner.py:549:main] cmd = /home/max/anaconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJ2YXN0YWkiOiBbMCwgMV19 --master_addr=178.116.84.30 --master_port=10700 --enable_each_rank_log=None run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path EleutherAI/gpt-j-6B --train_file train.txt --validation_file validation.txt --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned2 --num_train_epochs 4 --eval_steps 4 --gradient_accumulation_steps 32 --per_device_train_batch_size 8 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 10 --save_steps 48 --save_strategy steps --tokenizer_name gpt2 --load_best_model_at_end=True --block_size=2048 [2023-03-08 23:09:04,879] [INFO] [launch.py:142:main] WORLD INFO DICT: {'vastai': [0, 1]} [2023-03-08 23:09:04,879] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-03-08 23:09:04,879] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'vastai': [0, 1]}) [2023-03-08 23:09:04,879] [INFO] [launch.py:162:main] dist_world_size=2 [2023-03-08 23:09:04,879] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2023-03-08 23:09:09,759] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Traceback (most recent call last): File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 625, in
main()
File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 230, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
obj = dtype(inputs)
File "", line 109, in init
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1224, in post_init__
and (self.device.type != "cuda")
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1656, in device
return self._setup_devices
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1591, in _setup_devices
deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout))
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 661, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method)
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 30, in init__
self.init_process_group(backend, timeout, init_method)
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 34, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
return TCPStore(
RuntimeError: Stop_waiting response is expected
Traceback (most recent call last):
File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 625, in
main()
File "/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/finetuning_repo/run_clm.py", line 230, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
obj = dtype( inputs)
File "", line 109, in init
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1224, in post_init__
and (self.device.type != "cuda")
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1656, in device
return self._setup_devices
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/home/max/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 1591, in _setup_devices
deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout))
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 661, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method)
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 30, in init__
self.init_process_group(backend, timeout, init_method)
File "/home/max/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 34, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 786, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/max/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 318, in _store_based_barrier
store.add(store_key, 1)
RuntimeError: Connection reset by peer
[2023-03-08 23:09:10,901] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 15884
[2023-03-08 23:09:10,901] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 15885
[2023-03-08 23:09:10,911] [ERROR] [launch.py:324:sigkill_handler] ['/home/max/anaconda3/bin/python', '-u', 'run_clm.py', '--local_rank=1', '--deepspeed', 'ds_config_stage3.json', '--model_name_or_path', 'EleutherAI/gpt-j-6B', '--train_file', 'train.txt', '--validation_file', 'validation.txt', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned2', '--num_train_epochs', '4', '--eval_steps', '4', '--gradient_accumulation_steps', '32', '--per_device_train_batch_size', '8', '--use_fast_tokenizer', 'False', '--learning_rate', '5e-06', '--warmup_steps', '10', '--save_total_limit', '10', '--save_steps', '48', '--save_strategy', 'steps', '--tokenizer_name', 'gpt2', '--load_best_model_at_end=True', '--block_size=2048'] exits with return code = 1
To Reproduce Use the following files (IP address and port need to be changed): MyHostfile: vastai slots=2
ssh config file: Host vastai Hostname 178.116.84.30 User root Port 10700
Steps to reproduce the behavior: see above
Expected behavior finetuning should start
logging into the remote machine with "ssh vastai" works fine without any password insertion requirement
ds_report output (base) max@max-5824:/media/max/0D6109060D610906/GPT/finetune/Finetune_GPTNEO_GPTJ6B/$ ds_report
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/home/max/anaconda3/lib/python3.9/site-packages/torch'] torch version .................... 1.13.0 deepspeed install path ........... ['/home/max/anaconda3/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.8.1+867da307, 867da307, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
System info (please complete the following information):
Docker context No Docker image