A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer liangxianchen-asr-2wh-pretrain1-w-0.liangxianchen-asr-2wh-pretrain1.prdsafe.svc.hbox2-zzzc2-prd.local<48836>
liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO transport/net_socket.cc:493 -> 6
liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO include/net.h:35 -> 6
liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO transport/net.cc:1034 -> 6
liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO proxy.cc:520 -> 6
liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO proxy.cc:684 -> 6 [Proxy Thread]
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error: remote process exited or there was a network error, NCCL version 2.14.3
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
Net : Connection closed by remote peer liangxianchen-asr-2wh-pretrain1-w-0.liangxianchen-asr-2wh-pretrain1.prdsafe.svc.hbox2-zzzc2-prd.local<48836>
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174397 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174398 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174399 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174400 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174401 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174402 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174403 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 174396) of binary: /mnt/liangxianchen/anaconda3/envs/python38/bin/python
Traceback (most recent call last):
File "/mnt/liangxianchen/anaconda3/envs/python38/bin/torchrun", line 8, in
sys.exit(main())
File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)
🐛 Bug
执行: torchrun --nnodes 2 --node_rank 0 --nproc_per_node ${gpu_num} --master_addr *** --master_port 1234 \ ../../../funasr/bin/train_ds.py \ ++model="${model_name_or_model_dir}" \ ++train_data_set_list="${train_data}" \ ++valid_data_set_list="${val_data}" \ ++dataset="AudioDataset" \ ++dataset_conf.index_ds="IndexDSJsonl" \ ++dataset_conf.data_split_num=1 \ ++dataset_conf.batch_sampler="BatchSampler" \ ++dataset_conf.batch_size=6000 \ ++dataset_conf.sort_size=1024 \ ++dataset_conf.batch_type="token" \ ++dataset_conf.num_workers=12 \ ++train_conf.max_epoch=200 \ ++train_conf.log_interval=100 \ ++train_conf.resume=true \ ++train_conf.validate_interval=5000 \ ++train_conf.save_checkpoint_interval=5000 \ ++train_conf.keep_nbest_models=50 \ ++train_conf.avg_nbest_model=10 \ ++train_conf.use_deepspeed=true \ ++train_conf.deepspeed_config=${deepspeed_config} \ ++optim_conf.lr=0.0008 \ ++output_dir="${output_dir}" &> ${log_file}
模型在训练到几百step的时候报下面的错误:
liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer liangxianchen-asr-2wh-pretrain1-w-0.liangxianchen-asr-2wh-pretrain1.prdsafe.svc.hbox2-zzzc2-prd.local<48836> liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO transport/net_socket.cc:493 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO include/net.h:35 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO transport/net.cc:1034 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO proxy.cc:520 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO proxy.cc:684 -> 6 [Proxy Thread] [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error: remote process exited or there was a network error, NCCL version 2.14.3 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: Net : Connection closed by remote peer liangxianchen-asr-2wh-pretrain1-w-0.liangxianchen-asr-2wh-pretrain1.prdsafe.svc.hbox2-zzzc2-prd.local<48836> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174397 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174398 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174399 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174400 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174401 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174402 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174403 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 174396) of binary: /mnt/liangxianchen/anaconda3/envs/python38/bin/python Traceback (most recent call last): File "/mnt/liangxianchen/anaconda3/envs/python38/bin/torchrun", line 8, in
sys.exit(main())
File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
另外,我最开始设置batchsize=7000,在训练到300步的时候 就会报上面的错误,如果把batchsize设置为6000,在跑到1000步的时候报上面的错误,另外,我在torch.distributed.init_process_group()中增大超时参数,在batchsize设置为6000的时候,step 5000步的时候才报错