RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:520 (errno: 13 - Permission denied). The server socket has failed to bind to ?UNKNOWN? (errno: 13 - Permission denied). #140
Traceback (most recent call last): File "/home/fangzhijun2/ChatGLM-Finetuning-master/train.py", line 234, in
main()
File "/home/fangzhijun2/ChatGLM-Finetuning-master/train.py", line 79, in main
deepspeed.init_distributed()
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 121, in init
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 149, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 245, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 176, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:520 (errno: 13 - Permission denied). The server socket has failed to bind to ?UNKNOWN? (errno: 13 - Permission denied).
[2024-04-02 16:47:05,134] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3061266
[2024-04-02 16:47:05,134] [ERROR] [launch.py:322:sigkill_handler] ['/home/fangzhijun2/anaconda3/envs/torch/bin/python', '-u', 'train.py', '--local_rank=0', '--train_path', 'data/spo_0.json', '--model_name_or_path', 'ChatGLM3-6B/', '--per_device_train_batch_size', '1', '--max_len', '1560', '--max_src_len', '1024', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--warmup_ratio', '0.1', '--mode', 'glm3', '--lora_dim', '16', '--lora_alpha', '64', '--lora_dropout', '0.1', '--lora_module_name', 'query_key_value,dense_h_to_4h,dense_4h_to_h,dense', '--seed', '1234', '--ds_file', 'ds_zero2_no_offload.json', '--gradient_checkpointing', '--show_loss_step', '10', '--output_dir', './output-glm3'] exits with return code = 1