I'm running following deepspeed command for finetuning in my venv:
deepspeed trainer_sft.py --configs llama2-7b-sft-RLAIF --wandb-entity tammosta --show_dataset_stats --deepspeed
However, I'm getting following error:
Traceback (most recent call last):
File "/mnt/efs/data/tammosta/Open-Assistant/model/model_training/trainer_sft.py", line 485, in <module>
main()
File "/mnt/efs/data/tammosta/Open-Assistant/model/model_training/trainer_sft.py", line 298, in main
args = TrainingArguments(
File "<string>", line 112, in __init__
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/training_args.py", line 1372, in __post_init__
and (self.device.type != "cuda")
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/training_args.py", line 1795, in device
return self._setup_devices
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in __get__
cached = self.fget(obj)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/training_args.py", line 1735, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/accelerate/state.py", line 187, in __init__
dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 120, in __init__
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 146, in init_process_group
torch.distributed.init_process_group(backend,
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
return TCPStore(
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[2024-01-30 19:46:50,683] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-30 19:46:50,686] [INFO] [comm.py:637:init_distributed] cdb=None
Traceback (most recent call last):
File "/mnt/efs/data/tammosta/Open-Assistant/model/model_training/trainer_sft.py", line 485, in <module>
Traceback (most recent call last):
File "/mnt/efs/data/tammosta/Open-Assistant/model/model_training/trainer_sft.py", line 485, in <module>
main()
File "/mnt/efs/data/tammosta/Open-Assistant/model/model_training/trainer_sft.py", line 298, in main
main()
File "/mnt/efs/data/tammosta/Open-Assistant/model/model_training/trainer_sft.py", line 298, in main
args = TrainingArguments(
File "<string>", line 112, in __init__
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/training_args.py", line 1372, in __post_init__
and (self.device.type != "cuda")
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/training_args.py", line 1795, in device
args = TrainingArguments(
File "<string>", line 112, in __init__
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/training_args.py", line 1372, in __post_init__
and (self.device.type != "cuda")
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/training_args.py", line 1795, in device
return self._setup_devices
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in __get__
cached = self.fget(obj)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/training_args.py", line 1735, in _setup_devices
return self._setup_devices
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in __get__
cached = self.fget(obj)self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/transformers/training_args.py", line 1735, in _setup_devices
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/accelerate/state.py", line 187, in __init__
dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 120, in __init__
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 146, in init_process_group
torch.distributed.init_process_group(backend,
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/accelerate/state.py", line 187, in __init__
dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 120, in __init__
return TCPStore(
self.init_process_group(backend, timeout, init_method, rank, world_size)
torch.distributed File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 146, in init_process_group
.DistNetworkError: Connection reset by peer
torch.distributed.init_process_group(backend,
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
File "/opt/conda/envs/ml_v3/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
return TCPStore(
torch.distributed.DistNetworkError: Connection reset by peer
[2024-01-30 19:46:51,546] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3587180
[2024-01-30 19:46:51,659] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3587181
[2024-01-30 19:46:51,678] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3587182
[2024-01-30 19:46:51,695] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3587183
[2024-01-30 19:46:51,695] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3587184
[2024-01-30 19:46:51,766] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3587185
[2024-01-30 19:46:51,783] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3587186
[2024-01-30 19:46:51,798] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3587187
[2024-01-30 19:46:51,813] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/envs/ml_v3/bin/python3.10', '-u', 'trainer_sft.py', '--local_rank=7', '--configs', 'llama2-7b-sft-RLAIF', '--wandb-entity', 'tammosta', '--show_dataset_stats', '--deepspeed'] exits with return code = 1
Versions
Pytorch version:
2.2.0+cu121
CUDA version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
🐛 Describe the bug
I'm running following deepspeed command for finetuning in my venv:
deepspeed trainer_sft.py --configs llama2-7b-sft-RLAIF --wandb-entity tammosta --show_dataset_stats --deepspeed
However, I'm getting following error:
Versions
Pytorch version:
2.2.0+cu121
CUDA version:
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0