liucongg / ChatGLM-Finetuning

基于ChatGLM-6B、ChatGLM2-6B、ChatGLM3-6B模型,进行下游具体任务微调,涉及Freeze、Lora、P-tuning、全参微调等
2.68k stars 300 forks source link

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:520 (errno: 13 - Permission denied). The server socket has failed to bind to ?UNKNOWN? (errno: 13 - Permission denied). #140

Open ysz2000 opened 8 months ago

ysz2000 commented 8 months ago

Traceback (most recent call last): File "/home/fangzhijun2/ChatGLM-Finetuning-master/train.py", line 234, in main() File "/home/fangzhijun2/ChatGLM-Finetuning-master/train.py", line 79, in main deepspeed.init_distributed() File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 121, in init self.init_process_group(backend, timeout, init_method, rank, world_size) File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 149, in init_process_group torch.distributed.init_process_group(backend, File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 245, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 176, in _create_c10d_store return TCPStore( RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:520 (errno: 13 - Permission denied). The server socket has failed to bind to ?UNKNOWN? (errno: 13 - Permission denied). [2024-04-02 16:47:05,134] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3061266 [2024-04-02 16:47:05,134] [ERROR] [launch.py:322:sigkill_handler] ['/home/fangzhijun2/anaconda3/envs/torch/bin/python', '-u', 'train.py', '--local_rank=0', '--train_path', 'data/spo_0.json', '--model_name_or_path', 'ChatGLM3-6B/', '--per_device_train_batch_size', '1', '--max_len', '1560', '--max_src_len', '1024', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--warmup_ratio', '0.1', '--mode', 'glm3', '--lora_dim', '16', '--lora_alpha', '64', '--lora_dropout', '0.1', '--lora_module_name', 'query_key_value,dense_h_to_4h,dense_4h_to_h,dense', '--seed', '1234', '--ds_file', 'ds_zero2_no_offload.json', '--gradient_checkpointing', '--show_loss_step', '10', '--output_dir', './output-glm3'] exits with return code = 1

zkLyons commented 14 hours ago

不能使用520端口号,换一个端口号就可以了,大于1024的:CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 5200 train.py \ --train_path data/spo_0.json \ --model_name_or_path ChatGLM-6B/ \ --per_device_train_batch_size 1 \。。。。。