多级多卡训练paraformer模型报错

Notice: In order to resolve issues more efficiently, please raise issue following the template. （注意：为了更加高效率解决您遇到的问题，请按照模板提问，补充细节）

🐛 Bug

执行： torchrun --nnodes 2 --node_rank 0 --nproc_per_node ${gpu_num} --master_addr *** --master_port 1234 \ ../../../funasr/bin/train_ds.py \ ++model="${model_name_or_model_dir}" \ ++train_data_set_list="${train_data}" \ ++valid_data_set_list="${val_data}" \ ++dataset="AudioDataset" \ ++dataset_conf.index_ds="IndexDSJsonl" \ ++dataset_conf.data_split_num=1 \ ++dataset_conf.batch_sampler="BatchSampler" \ ++dataset_conf.batch_size=6000 \ ++dataset_conf.sort_size=1024 \ ++dataset_conf.batch_type="token" \ ++dataset_conf.num_workers=12 \ ++train_conf.max_epoch=200 \ ++train_conf.log_interval=100 \ ++train_conf.resume=true \ ++train_conf.validate_interval=5000 \ ++train_conf.save_checkpoint_interval=5000 \ ++train_conf.keep_nbest_models=50 \ ++train_conf.avg_nbest_model=10 \ ++train_conf.use_deepspeed=true \ ++train_conf.deepspeed_config=${deepspeed_config} \ ++optim_conf.lr=0.0008 \ ++output_dir="${output_dir}" &> ${log_file}

模型在训练到几百step的时候报下面的错误：

liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer liangxianchen-asr-2wh-pretrain1-w-0.liangxianchen-asr-2wh-pretrain1.prdsafe.svc.hbox2-zzzc2-prd.local<48836> liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO transport/net_socket.cc:493 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO include/net.h:35 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO transport/net.cc:1034 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO proxy.cc:520 -> 6 liangxianchen-asr-2wh-pretrain1-m-0:174396:175238 [0] NCCL INFO proxy.cc:684 -> 6 [Proxy Thread] [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error: remote process exited or there was a network error, NCCL version 2.14.3 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: Net : Connection closed by remote peer liangxianchen-asr-2wh-pretrain1-w-0.liangxianchen-asr-2wh-pretrain1.prdsafe.svc.hbox2-zzzc2-prd.local<48836> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174397 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174398 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174399 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174400 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174401 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174402 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 174403 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 174396) of binary: /mnt/liangxianchen/anaconda3/envs/python38/bin/python Traceback (most recent call last): File "/mnt/liangxianchen/anaconda3/envs/python38/bin/torchrun", line 8, in sys.exit(main()) File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/liangxianchen/anaconda3/envs/python38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

另外，我最开始设置batchsize=7000，在训练到300步的时候就会报上面的错误，如果把batchsize设置为6000，在跑到1000步的时候报上面的错误，另外，我在torch.distributed.init_process_group（）中增大超时参数，在batchsize设置为6000的时候，step 5000步的时候才报错

modelscope / FunASR

多级多卡训练paraformer模型报错 #1930

🐛 Bug