Notice: In order to resolve issues more efficiently, please raise issue following the template. （注意：为了更加高效率解决您遇到的问题，请按照模板提问，补充细节）

❓ Questions and Help

在训练10T左右的语音数据，多级多卡分布式训练过程中，NCCL会报超时错误，请问如何解决

Before asking:

search the issues.
search the docs.

What is your question?

Code

[2024-08-12 23:04:07,255][root][INFO] - train, rank: 9, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.886), (loss_avg_slice: 0.669), (ppl_avg_slice: 1.953e+00), (acc_avg_slice: 0.746), (lr: 5.729e-04), [('loss_att', 0.768), ('acc', 0.678), ('loss_pre', 0.119), ('loss', 0.886)], {'data_load': '0.000', 'forward_time': '0.294', 'backward_time': '0.279', 'optim_time': '0.099', 'total_time': '0.673'}, GPU, memory: usage: 1.165 GB, peak: 64.227 GB, cache: 65.062 GB, cache_peak: 65.062 GB [2024-08-12 23:04:07,255][root][INFO] - train, rank: 15, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.791), (loss_avg_slice: 0.672), (ppl_avg_slice: 1.959e+00), (acc_avg_slice: 0.746), (lr: 5.729e-04), [('loss_att', 0.699), ('acc', 0.695), ('loss_pre', 0.092), ('loss', 0.791)], {'data_load': '0.000', 'forward_time': '0.283', 'backward_time': '0.284', 'optim_time': '0.104', 'total_time': '0.673'}, GPU, memory: usage: 1.265 GB, peak: 62.212 GB, cache: 62.998 GB, cache_peak: 62.998 GB [2024-08-12 23:04:07,255][root][INFO] - train, rank: 12, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.876), (loss_avg_slice: 0.672), (ppl_avg_slice: 1.958e+00), (acc_avg_slice: 0.746), (lr: 5.729e-04), [('loss_att', 0.776), ('acc', 0.68), ('loss_pre', 0.1), ('loss', 0.876)], {'data_load': '0.001', 'forward_time': '0.300', 'backward_time': '0.290', 'optim_time': '0.081', 'total_time': '0.673'}, GPU, memory: usage: 1.252 GB, peak: 63.860 GB, cache: 64.854 GB, cache_peak: 64.854 GB [2024-08-12 23:04:07,255][root][INFO] - train, rank: 10, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.866), (loss_avg_slice: 0.671), (ppl_avg_slice: 1.957e+00), (acc_avg_slice: 0.746), (lr: 5.729e-04), [('loss_att', 0.756), ('acc', 0.654), ('loss_pre', 0.11), ('loss', 0.866)], {'data_load': '0.001', 'forward_time': '0.300', 'backward_time': '0.278', 'optim_time': '0.094', 'total_time': '0.673'}, GPU, memory: usage: 1.143 GB, peak: 59.564 GB, cache: 60.125 GB, cache_peak: 60.125 GB [2024-08-12 23:04:07,255][root][INFO] - train, rank: 14, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.827), (loss_avg_slice: 0.671), (ppl_avg_slice: 1.957e+00), (acc_avg_slice: 0.746), (lr: 5.729e-04), [('loss_att', 0.734), ('acc', 0.687), ('loss_pre', 0.093), ('loss', 0.827)], {'data_load': '0.000', 'forward_time': '0.297', 'backward_time': '0.281', 'optim_time': '0.094', 'total_time': '0.673'}, GPU, memory: usage: 1.267 GB, peak: 60.196 GB, cache: 60.869 GB, cache_peak: 60.869 GB [rank15]:[E ProcessGroupNCCL.cpp:563] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [rank11]:[E ProcessGroupNCCL.cpp:563] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. [rank8]:[E ProcessGroupNCCL.cpp:563] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. [rank10]:[E ProcessGroupNCCL.cpp:563] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [rank13]:[E ProcessGroupNCCL.cpp:563] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [rank12]:[E ProcessGroupNCCL.cpp:563] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. [rank9]:[E ProcessGroupNCCL.cpp:563] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. [rank14]:[E ProcessGroupNCCL.cpp:563] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600098 milliseconds before timing out. liangxianchen-funasr-15wh-train-lxc0-m-0:14051:14895 [0] NCCL INFO [Service thread] Connection closed by localRank 5 liangxianchen-funasr-15wh-train-lxc0-m-0:14053:14893 [2] NCCL INFO [Service thread] Connection closed by localRank 5 liangxianchen-funasr-15wh-train-lxc0-m-0:14055:14889 [4] NCCL INFO [Service thread] Connection closed by localRank 5 liangxianchen-funasr-15wh-train-lxc0-m-0:14058:14883 [7] NCCL INFO [Service thread] Connection closed by localRank 7 liangxianchen-funasr-15wh-train-lxc0-m-0:14056:14887 [5] NCCL INFO [Service thread] Connection closed by localRank 5 liangxianchen-funasr-15wh-train-lxc0-m-0:14057:14885 [6] NCCL INFO [Service thread] Connection closed by localRank 7 liangxianchen-funasr-15wh-train-lxc0-m-0:14051:14895 [0] NCCL INFO [Service thread] Connection closed by localRank 7 liangxianchen-funasr-15wh-train-lxc0-m-0:14053:14893 [2] NCCL INFO [Service thread] Connection closed by localRank 7 liangxianchen-funasr-15wh-train-lxc0-m-0:14055:14889 [4] NCCL INFO [Service thread] Connection closed by localRank 7 liangxianchen-funasr-15wh-train-lxc0-m-0:14057:14885 [6] NCCL INFO [Service thread] Connection closed by localRank 6 liangxianchen-funasr-15wh-train-lxc0-m-0:14051:14895 [0] NCCL INFO [Service thread] Connection closed by localRank 6 liangxianchen-funasr-15wh-train-lxc0-m-0:14053:14893 [2] NCCL INFO [Service thread] Connection closed by localRank 6 liangxianchen-funasr-15wh-train-lxc0-m-0:14055:14889 [4] NCCL INFO [Service thread] Connection closed by localRank 6 liangxianchen-funasr-15wh-train-lxc0-m-0:14057:14885 [6] NCCL INFO [Service thread] Connection closed by localRank 5 liangxianchen-funasr-15wh-train-lxc0-m-0:14058:14817 [7] NCCL INFO comm 0xa73b540 rank 15 nranks 16 cudaDev 7 busId e0000 - Abort COMPLETE [rank15]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 15] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank15]:[E ProcessGroupNCCL.cpp:577] [Rank 15] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank15]:[E ProcessGroupNCCL.cpp:583] [Rank 15] To avoid data inconsistency, we are taking the entire process down. [rank15]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f513597a897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f50e9a5a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f50e9a5efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f50e9a6031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f51354c7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f5136cddac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f5136d6f850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

liangxianchen-funasr-15wh-train-lxc0-m-0:14051:14895 [0] NCCL INFO [Service thread] Connection closed by localRank 2 liangxianchen-funasr-15wh-train-lxc0-m-0:14053:14893 [2] NCCL INFO [Service thread] Connection closed by localRank 2 liangxianchen-funasr-15wh-train-lxc0-m-0:14055:14889 [4] NCCL INFO [Service thread] Connection closed by localRank 2 liangxianchen-funasr-15wh-train-lxc0-m-0:14057:14885 [6] NCCL INFO [Service thread] Connection closed by localRank 2 liangxianchen-funasr-15wh-train-lxc0-m-0:14056:14827 [5] NCCL INFO comm 0xb823c00 rank 13 nranks 16 cudaDev 5 busId a1000 - Abort COMPLETE [rank13]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 13] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank13]:[E ProcessGroupNCCL.cpp:577] [Rank 13] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank13]:[E ProcessGroupNCCL.cpp:583] [Rank 13] To avoid data inconsistency, we are taking the entire process down. [rank13]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fadf3ecf897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fada7c5a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fada7c5efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fada7c6031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7fadf36c7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fadf4f2bac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fadf4fbd850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

liangxianchen-funasr-15wh-train-lxc0-m-0:14051:14895 [0] NCCL INFO [Service thread] Connection closed by localRank 3 liangxianchen-funasr-15wh-train-lxc0-m-0:14053:14893 [2] NCCL INFO [Service thread] Connection closed by localRank 3 liangxianchen-funasr-15wh-train-lxc0-m-0:14055:14889 [4] NCCL INFO [Service thread] Connection closed by localRank 3 liangxianchen-funasr-15wh-train-lxc0-m-0:14057:14885 [6] NCCL INFO [Service thread] Connection closed by localRank 3 liangxianchen-funasr-15wh-train-lxc0-m-0:14054:14891 [3] NCCL INFO [Service thread] Connection closed by localRank 3 liangxianchen-funasr-15wh-train-lxc0-m-0:14051:14895 [0] NCCL INFO [Service thread] Connection closed by localRank 4 liangxianchen-funasr-15wh-train-lxc0-m-0:14053:14893 [2] NCCL INFO [Service thread] Connection closed by localRank 4 liangxianchen-funasr-15wh-train-lxc0-m-0:14055:14889 [4] NCCL INFO [Service thread] Connection closed by localRank 4 liangxianchen-funasr-15wh-train-lxc0-m-0:14057:14885 [6] NCCL INFO [Service thread] Connection closed by localRank 4 liangxianchen-funasr-15wh-train-lxc0-m-0:14054:14825 [3] NCCL INFO comm 0x9bc5a80 rank 11 nranks 16 cudaDev 3 busId 4e000 - Abort COMPLETE [rank11]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 11] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank11]:[E ProcessGroupNCCL.cpp:577] [Rank 11] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank11]:[E ProcessGroupNCCL.cpp:583] [Rank 11] To avoid data inconsistency, we are taking the entire process down. [rank11]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 11] Process group watchdog thread terminated with exception: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1d7577a897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f1d2945a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1d2945efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1d2946031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f1d74ec7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f1d767f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f1d76882850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

liangxianchen-funasr-15wh-train-lxc0-m-0:14051:14895 [0] NCCL INFO [Service thread] Connection closed by localRank 0 liangxianchen-funasr-15wh-train-lxc0-m-0:14053:14893 [2] NCCL INFO [Service thread] Connection closed by localRank 0 liangxianchen-funasr-15wh-train-lxc0-m-0:14055:14889 [4] NCCL INFO [Service thread] Connection closed by localRank 0 liangxianchen-funasr-15wh-train-lxc0-m-0:14057:14885 [6] NCCL INFO [Service thread] Connection closed by localRank 0 liangxianchen-funasr-15wh-train-lxc0-m-0:14051:14895 [0] NCCL INFO [Service thread] Connection closed by localRank 1 liangxianchen-funasr-15wh-train-lxc0-m-0:14053:14893 [2] NCCL INFO [Service thread] Connection closed by localRank 1 liangxianchen-funasr-15wh-train-lxc0-m-0:14055:14889 [4] NCCL INFO [Service thread] Connection closed by localRank 1 liangxianchen-funasr-15wh-train-lxc0-m-0:14052:14897 [1] NCCL INFO [Service thread] Connection closed by localRank 1 liangxianchen-funasr-15wh-train-lxc0-m-0:14057:14885 [6] NCCL INFO [Service thread] Connection closed by localRank 1 liangxianchen-funasr-15wh-train-lxc0-m-0:14053:14839 [2] NCCL INFO comm 0x9bbaa80 rank 10 nranks 16 cudaDev 2 busId 48000 - Abort COMPLETE liangxianchen-funasr-15wh-train-lxc0-m-0:14052:14821 [1] NCCL INFO comm 0x13a5d7c0 rank 9 nranks 16 cudaDev 1 busId 10000 - Abort COMPLETE liangxianchen-funasr-15wh-train-lxc0-m-0:14055:14819 [4] NCCL INFO comm 0x13025cc0 rank 12 nranks 16 cudaDev 4 busId 9b000 - Abort COMPLETE [rank9]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 9] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank9]:[E ProcessGroupNCCL.cpp:577] [Rank 9] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank9]:[E ProcessGroupNCCL.cpp:583] [Rank 9] To avoid data inconsistency, we are taking the entire process down. [rank9]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 9] Process group watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd083ccf897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fd037a5a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd037a5efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd037a6031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7fd0834c7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fd084d61ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fd084df3850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank12]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 12] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank12]:[E ProcessGroupNCCL.cpp:577] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank12]:[E ProcessGroupNCCL.cpp:583] [Rank 12] To avoid data inconsistency, we are taking the entire process down. [rank12]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc84bf7a897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc7ffc5a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc7ffc5efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc7ffc6031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7fc84b6c7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fc84d004ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fc84d096850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank10]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 10] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank10]:[E ProcessGroupNCCL.cpp:577] [Rank 10] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank10]:[E ProcessGroupNCCL.cpp:583] [Rank 10] To avoid data inconsistency, we are taking the entire process down. liangxianchen-funasr-15wh-train-lxc0-m-0:14051:14875 [0] NCCL INFO comm 0x9f2ef80 rank 8 nranks 16 cudaDev 0 busId b000 - Abort COMPLETE [rank10]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 10] Process group watchdog thread terminated with exception: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7efd97a897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7eb165a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7eb165efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7eb166031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f7efd0c7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f7efea8bac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f7efeb1d850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank8]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 8] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank8]:[E ProcessGroupNCCL.cpp:577] [Rank 8] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank8]:[E ProcessGroupNCCL.cpp:583] [Rank 8] To avoid data inconsistency, we are taking the entire process down. [rank8]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 8] Process group watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4f0429e897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4eb805a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4eb805efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4eb806031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f4f03ac7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f4f05301ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f4f05393850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

liangxianchen-funasr-15wh-train-lxc0-m-0:14057:14823 [6] NCCL INFO comm 0xaf19000 rank 14 nranks 16 cudaDev 6 busId db000 - Abort COMPLETE [rank14]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 14] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank14]:[E ProcessGroupNCCL.cpp:577] [Rank 14] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank14]:[E ProcessGroupNCCL.cpp:583] [Rank 14] To avoid data inconsistency, we are taking the entire process down. [rank14]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600098 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0bf5f7a897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f0ba9c5a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0ba9c5efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0ba9c6031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f0bf56c7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f0bf70a6ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f0bf7138850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0812 23:27:08.684369 140702075602752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 14053 closing signal SIGTERM W0812 23:27:08.686259 140702075602752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 14054 closing signal SIGTERM W0812 23:27:08.688141 140702075602752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 14055 closing signal SIGTERM W0812 23:27:08.690283 140702075602752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 14056 closing signal SIGTERM W0812 23:27:08.693412 140702075602752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 14058 closing signal SIGTERM E0812 23:27:24.620319 140702075602752 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 14051) of binary: /lxc-data/minianconda3/envs/python39/bin/python Traceback (most recent call last): File "/lxc-data/minianconda3/envs/python39/bin/torchrun", line 8, in sys.exit(main()) File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, kwargs) File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../../funasr/bin/train_ds.py FAILED

Failures: [1]: time : 2024-08-12_23:27:08 host : liangxianchen-funasr-15wh-train-lxc0-m-0.liangxianchen-funasr-15wh-train-lxc0.hbox-aigc.svc.hbox2-zzzc2-prd.local rank : 9 (local_rank: 1) exitcode : -6 (pid: 14052) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 14052 [2]: time : 2024-08-12_23:27:08 host : liangxianchen-funasr-15wh-train-lxc0-m-0.liangxianchen-funasr-15wh-train-lxc0.hbox-aigc.svc.hbox2-zzzc2-prd.local rank : 14 (local_rank: 6) exitcode : -6 (pid: 14057) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 14057

Root Cause (first observed failure): [0]: time : 2024-08-12_23:27:08 host : liangxianchen-funasr-15wh-train-lxc0-m-0.liangxianchen-funasr-15wh-train-lxc0.hbox-aigc.svc.hbox2-zzzc2-prd.local rank : 8 (local_rank: 0) exitcode : -6 (pid: 14051) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 14051

}, GPU, memory: usage: 1.516 GB, peak: 59.564 GB, cache: 60.074 GB, cache_peak: 60.074 GB [2024-08-12 23:00:19,907][root][INFO] - train, rank: 3, epoch: 0/500, data_slice: 0/5, step_in_slice: 32700/81081, step_in_epoch: 32700, total step: 32700, (loss_avg_rank: 0.737), (loss_avg_slice: 0.670), (ppl_avg_slice: 1.955e+00), (acc_avg_slice: 0.748), (lr: 5.747e-04), [('loss_att', 0.67), ('acc', 0.75), ('loss_pre', 0.067), ('loss', 0.737)], {'data_load': '0.001', 'forward_time': '0.318', 'backward_time': '0.321', 'optim_time': '0.112', 'total_time': '0.753'}, GPU, memory: usage: 1.527 GB, peak: 62.395 GB, cache: 62.867 GB, cache_peak: 62.867 GB [2024-08-12 23:00:19,907][root][INFO] - train, rank: 1, epoch: 0/500, data_slice: 0/5, step_in_slice: 32700/81081, step_in_epoch: 32700, total step: 32700, (loss_avg_rank: 0.735), (loss_avg_slice: 0.673), (ppl_avg_slice: 1.961e+00), (acc_avg_slice: 0.747), (lr: 5.747e-04), [('loss_att', 0.666), ('acc', 0.74), ('loss_pre', 0.069), ('loss', 0.735)], {'data_load': '0.000', 'forward_time': '0.340', 'backward_time': '0.325', 'optim_time': '0.086', 'total_time': '0.753'}, GPU, memory: usage: 1.522 GB, peak: 65.143 GB, cache: 66.145 GB, cache_peak: 66.145 GB [2024-08-12 23:00:19,907][root][INFO] - train, rank: 7, epoch: 0/500, data_slice: 0/5, step_in_slice: 32700/81081, step_in_epoch: 32700, total step: 32700, (loss_avg_rank: 0.706), (loss_avg_slice: 0.670), (ppl_avg_slice: 1.955e+00), (acc_avg_slice: 0.746), (lr: 5.747e-04), [('loss_att', 0.635), ('acc', 0.755), ('loss_pre', 0.071), ('loss', 0.706)], {'data_load': '0.001', 'forward_time': '0.321', 'backward_time': '0.344', 'optim_time': '0.086', 'total_time': '0.753'}, GPU, memory: usage: 1.535 GB, peak: 60.563 GB, cache: 60.973 GB, cache_peak: 60.973 GB [2024-08-12 23:00:19,908] [INFO] [timer.py:258:stop] epoch=0/micro_step=4700/global_step=4700, RunningAvgSamplesPerSec=14.807020542856518, CurrSamplesPerSec=21.249604274437697, MemAllocated=1.52GB, MaxMemAllocated=64.96GB [2024-08-12 23:00:19,917][root][INFO] - train, rank: 0, epoch: 0/500, data_slice: 0/5, step_in_slice: 32700/81081, step_in_epoch: 32700, total step: 32700, (loss_avg_rank: 0.745), (loss_avg_slice: 0.672), (ppl_avg_slice: 1.958e+00), (acc_avg_slice: 0.747), (lr: 5.747e-04), [('loss_att', 0.681), ('acc', 0.754), ('loss_pre', 0.064), ('loss', 0.745)], {'data_load': '0.000', 'forward_time': '0.325', 'backward_time': '0.323', 'optim_time': '0.112', 'total_time': '0.762'}, GPU, memory: usage: 1.519 GB, peak: 64.960 GB, cache: 65.945 GB, cache_peak: 65.945 GB [2024-08-12 23:01:41,928] [INFO] [logging.py:96:log_dist] [Rank 0] step=32800, skipped=0, lr=[0.0005738102947155223], mom=[(0.9, 0.999)] [2024-08-12 23:01:41,940] [INFO] [timer.py:258:stop] epoch=0/micro_step=4800/global_step=4800, RunningAvgSamplesPerSec=14.88238640634599, CurrSamplesPerSec=20.680793285639226, MemAllocated=1.8GB, MaxMemAllocated=64.96GB [2024-08-12 23:01:41,941][root][INFO] - train, rank: 1, epoch: 0/500, data_slice: 0/5, step_in_slice: 32800/81081, step_in_epoch: 32800, total step: 32800, (loss_avg_rank: 0.521), (loss_avg_slice: 0.673), (ppl_avg_slice: 1.960e+00), (acc_avg_slice: 0.747), (lr: 5.738e-04), [('loss_att', 0.475), ('acc', 0.812), ('loss_pre', 0.046), ('loss', 0.521)], {'data_load': '0.000', 'forward_time': '0.304', 'backward_time': '0.323', 'optim_time': '0.148', 'total_time': '0.777'}, GPU, memory: usage: 1.804 GB, peak: 65.143 GB, cache: 66.145 GB, cache_peak: 66.145 GB [2024-08-12 23:01:41,941][root][INFO] - train, rank: 3, epoch: 0/500, data_slice: 0/5, step_in_slice: 32800/81081, step_in_epoch: 32800, total step: 32800, (loss_avg_rank: 0.619), (loss_avg_slice: 0.670), (ppl_avg_slice: 1.954e+00), (acc_avg_slice: 0.748), (lr: 5.738e-04), [('loss_att', 0.566), ('acc', 0.78), ('loss_pre', 0.054), ('loss', 0.619)], {'data_load': '0.001', 'forward_time': '0.300', 'backward_time': '0.313', 'optim_time': '0.162', 'total_time': '0.777'}, GPU, memory: usage: 1.802 GB, peak: 62.395 GB, cache: 62.867 GB, cache_peak: 62.867 GB [2024-08-12 23:01:41,941][root][INFO] - train, rank: 4, epoch: 0/500, data_slice: 0/5, step_in_slice: 32800/81081, step_in_epoch: 32800, total step: 32800, (loss_avg_rank: 0.536), (loss_avg_slice: 0.671), (ppl_avg_slice: 1.957e+00), (acc_avg_slice: 0.747), (lr: 5.738e-04), [('loss_att', 0.489), ('acc', 0.788), ('loss_pre', 0.047), ('loss', 0.536)], {'data_load': '0.000', 'forward_time': '0.297', 'backward_time': '0.332', 'optim_time': '0.146', 'total_time': '0.777'}, GPU, memory: usage: 1.811 GB, peak: 59.564 GB, cache: 60.064 GB, cache_peak: 60.064 GB [2024-08-12 23:01:41,941][root][INFO] - train, rank: 2, epoch: 0/500, data_slice: 0/5, step_in_slice: 32800/81081, step_in_epoch: 32800, total step: 32800, (loss_avg_rank: 0.543), (loss_avg_slice: 0.671), (ppl_avg_slice: 1.956e+00), (acc_avg_slice: 0.749), (lr: 5.738e-04), [('loss_att', 0.49), ('acc', 0.831), ('loss_pre', 0.052), ('loss', 0.543)], {'data_load': '0.000', 'forward_time': '0.285', 'backward_time': '0.327', 'optim_time': '0.163', 'total_time': '0.777'}, GPU, memory: usage: 1.794 GB, peak: 60.563 GB, cache: 60.988 GB, cache_peak: 60.988 GB [2024-08-12 23:01:41,941][root][INFO] - train, rank: 6, epoch: 0/500, data_slice: 0/5, step_in_slice: 32800/81081, step_in_epoch: 32800, total step: 32800, (loss_avg_rank: 0.467), (loss_avg_slice: 0.671), (ppl_avg_slice: 1.957e+00), (acc_avg_slice: 0.746), (lr: 5.738e-04), [('loss_att', 0.419), ('acc', 0.829), ('loss_pre', 0.048), ('loss', 0.467)], {'data_load': '0.001', 'forward_time': '0.280', 'backward_time': '0.324', 'optim_time': '0.170', 'total_time': '0.777'}, GPU, memory: usage: 1.811 GB, peak: 61.295 GB, cache: 61.842 GB, cache_peak: 61.842 GB [2024-08-12 23:01:41,941][root][INFO] - train, rank: 5, epoch: 0/500, data_slice: 0/5, step_in_slice: 32800/81081, step_in_epoch: 32800, total step: 32800, (loss_avg_rank: 0.546), (loss_avg_slice: 0.668), (ppl_avg_slice: 1.950e+00), (acc_avg_slice: 0.746), (lr: 5.738e-04), [('loss_att', 0.501), ('acc', 0.78), ('loss_pre', 0.045), ('loss', 0.546)], {'data_load': '0.001', 'forward_time': '0.288', 'backward_time': '0.316', 'optim_time': '0.172', 'total_time': '0.777'}, GPU, memory: usage: 1.789 GB, peak: 59.564 GB, cache: 60.074 GB, cache_peak: 60.074 GB [2024-08-12 23:01:41,942][root][INFO] - train, rank: 7, epoch: 0/500, data_slice: 0/5, step_in_slice: 32800/81081, step_in_epoch: 32800, total step: 32800, (loss_avg_rank: 0.520), (loss_avg_slice: 0.670), (ppl_avg_slice: 1.954e+00), (acc_avg_slice: 0.746), (lr: 5.738e-04), [('loss_att', 0.475), ('acc', 0.802), ('loss_pre', 0.045), ('loss', 0.52)], {'data_load': '0.001', 'forward_time': '0.304', 'backward_time': '0.344', 'optim_time': '0.126', 'total_time': '0.777'}, GPU, memory: usage: 1.800 GB, peak: 60.563 GB, cache: 61.027 GB, cache_peak: 61.027 GB [2024-08-12 23:01:41,943][root][INFO] - train, rank: 0, epoch: 0/500, data_slice: 0/5, step_in_slice: 32800/81081, step_in_epoch: 32800, total step: 32800, (loss_avg_rank: 0.526), (loss_avg_slice: 0.672), (ppl_avg_slice: 1.958e+00), (acc_avg_slice: 0.747), (lr: 5.738e-04), [('loss_att', 0.482), ('acc', 0.821), ('loss_pre', 0.043), ('loss', 0.526)], {'data_load': '0.000', 'forward_time': '0.303', 'backward_time': '0.325', 'optim_time': '0.148', 'total_time': '0.777'}, GPU, memory: usage: 1.797 GB, peak: 64.960 GB, cache: 65.945 GB, cache_peak: 65.945 GB laod bad voice file!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [2024-08-12 23:04:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=32900, skipped=0, lr=[0.0005729376054790289], mom=[(0.9, 0.999)] [2024-08-12 23:04:07,257] [INFO] [timer.py:258:stop] epoch=0/micro_step=4900/global_step=4900, RunningAvgSamplesPerSec=14.776888232603792, CurrSamplesPerSec=23.81720738072651, MemAllocated=1.09GB, MaxMemAllocated=64.96GB [2024-08-12 23:04:07,257][root][INFO] - train, rank: 5, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.844), (loss_avg_slice: 0.668), (ppl_avg_slice: 1.951e+00), (acc_avg_slice: 0.747), (lr: 5.729e-04), [('loss_att', 0.728), ('acc', 0.626), ('loss_pre', 0.116), ('loss', 0.844)], {'data_load': '0.000', 'forward_time': '0.281', 'backward_time': '0.286', 'optim_time': '0.104', 'total_time': '0.673'}, GPU, memory: usage: 1.123 GB, peak: 59.564 GB, cache: 60.074 GB, cache_peak: 60.074 GB [2024-08-12 23:04:07,257][root][INFO] - train, rank: 2, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.998), (loss_avg_slice: 0.671), (ppl_avg_slice: 1.957e+00), (acc_avg_slice: 0.749), (lr: 5.729e-04), [('loss_att', 0.858), ('acc', 0.657), ('loss_pre', 0.14), ('loss', 0.998)], {'data_load': '0.000', 'forward_time': '0.288', 'backward_time': '0.276', 'optim_time': '0.108', 'total_time': '0.673'}, GPU, memory: usage: 1.105 GB, peak: 60.563 GB, cache: 60.988 GB, cache_peak: 60.988 GB [2024-08-12 23:04:07,257][root][INFO] - train, rank: 4, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.906), (loss_avg_slice: 0.672), (ppl_avg_slice: 1.958e+00), (acc_avg_slice: 0.747), (lr: 5.729e-04), [('loss_att', 0.8), ('acc', 0.679), ('loss_pre', 0.105), ('loss', 0.906)], {'data_load': '0.001', 'forward_time': '0.281', 'backward_time': '0.279', 'optim_time': '0.111', 'total_time': '0.673'}, GPU, memory: usage: 1.113 GB, peak: 59.564 GB, cache: 60.064 GB, cache_peak: 60.064 GB [2024-08-12 23:04:07,257][root][INFO] - train, rank: 6, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.927), (loss_avg_slice: 0.671), (ppl_avg_slice: 1.957e+00), (acc_avg_slice: 0.746), (lr: 5.729e-04), [('loss_att', 0.815), ('acc', 0.648), ('loss_pre', 0.112), ('loss', 0.927)], {'data_load': '0.001', 'forward_time': '0.279', 'backward_time': '0.269', 'optim_time': '0.123', 'total_time': '0.673'}, GPU, memory: usage: 1.124 GB, peak: 61.295 GB, cache: 61.842 GB, cache_peak: 61.842 GB [2024-08-12 23:04:07,257][root][INFO] - train, rank: 3, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.939), (loss_avg_slice: 0.670), (ppl_avg_slice: 1.955e+00), (acc_avg_slice: 0.748), (lr: 5.729e-04), [('loss_att', 0.795), ('acc', 0.688), ('loss_pre', 0.144), ('loss', 0.939)], {'data_load': '0.000', 'forward_time': '0.273', 'backward_time': '0.258', 'optim_time': '0.141', 'total_time': '0.673'}, GPU, memory: usage: 1.108 GB, peak: 62.395 GB, cache: 62.867 GB, cache_peak: 62.867 GB [2024-08-12 23:04:07,258][root][INFO] - train, rank: 1, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.888), (loss_avg_slice: 0.673), (ppl_avg_slice: 1.961e+00), (acc_avg_slice: 0.747), (lr: 5.729e-04), [('loss_att', 0.765), ('acc', 0.677), ('loss_pre', 0.123), ('loss', 0.888)], {'data_load': '0.001', 'forward_time': '0.291', 'backward_time': '0.271', 'optim_time': '0.109', 'total_time': '0.673'}, GPU, memory: usage: 1.097 GB, peak: 65.143 GB, cache: 66.145 GB, cache_peak: 66.145 GB [2024-08-12 23:04:07,258][root][INFO] - train, rank: 7, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.969), (loss_avg_slice: 0.670), (ppl_avg_slice: 1.954e+00), (acc_avg_slice: 0.746), (lr: 5.729e-04), [('loss_att', 0.854), ('acc', 0.674), ('loss_pre', 0.115), ('loss', 0.969)], {'data_load': '0.001', 'forward_time': '0.287', 'backward_time': '0.280', 'optim_time': '0.104', 'total_time': '0.673'}, GPU, memory: usage: 1.139 GB, peak: 60.563 GB, cache: 61.027 GB, cache_peak: 61.027 GB [2024-08-12 23:04:07,259][root][INFO] - train, rank: 0, epoch: 0/500, data_slice: 0/5, step_in_slice: 32900/81081, step_in_epoch: 32900, total step: 32900, (loss_avg_rank: 0.932), (loss_avg_slice: 0.672), (ppl_avg_slice: 1.958e+00), (acc_avg_slice: 0.747), (lr: 5.729e-04), [('loss_att', 0.791), ('acc', 0.703), ('loss_pre', 0.141), ('loss', 0.932)], {'data_load': '0.000', 'forward_time': '0.275', 'backward_time': '0.258', 'optim_time': '0.140', 'total_time': '0.675'}, GPU, memory: usage: 1.088 GB, peak: 64.960 GB, cache: 65.945 GB, cache_peak: 65.945 GB Error executing job with overrides: '++model=/lxc-data/FunASR/examples/industrial_data_pretraining/paraformer_streaming/modelscope_models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online', '++train_data_set_list=../../../data/10w_list/15Wdata_nochat_5s_train.jsonl', '++valid_data_set_list=../../../data/10w_list/2Wdata_nochat_test.jsonl', '++dataset=AudioDataset', '++dataset_conf.index_ds=IndexDSJsonl', '++dataset_conf.data_split_num=5', '++dataset_conf.batch_sampler=BatchSampler', '++dataset_conf.batch_size=30000', '++dataset_conf.sort_size=1024', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=8', '++train_conf.max_epoch=500', '++train_conf.log_interval=100', '++train_conf.resume=true', '++train_conf.validate_interval=4000', '++train_conf.save_checkpoint_interval=4000', '++train_conf.keep_nbest_models=20', '++train_conf.avg_nbest_model=10', '++train_conf.use_deepspeed=true', '++train_conf.deepspeed_config=/lxc-data/FunASR/examples/deepspeed_conf/ds_stage2.json', '++optim_conf.lr=0.0006', '++output_dir=./15W_train_ds': Traceback (most recent call last): rank2: File "/lxc-data/FunASR/examples/industrial_data_pretraining/paraformer_streaming/../../../funasr/bin/train_ds.py", line 229, in

rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/hydra/main.py", line 94, in decorated_main

rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra

rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/hydra/_internal/utils.py", line 457, in _run_app

rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/hydra/_internal/utils.py", line 223, in run_and_report rank2: raise ex rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report rank2: return func() rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in rank2: lambda: hydra.run( rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/hydra/internal/hydra.py", line 132, in run rank2: = ret.return_value rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value rank2: raise self._return_value rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job rank2: ret.return_value = task_function(task_cfg) rank2: File "/lxc-data/FunASR/examples/industrial_data_pretraining/paraformer_streaming/../../../funasr/bin/train_ds.py", line 56, in main_hydra

rank2: File "/lxc-data/FunASR/examples/industrial_data_pretraining/paraformer_streaming/../../../funasr/bin/train_ds.py", line 177, in main

rank2: File "/lxc-data/FunASR/funasr/train_utils/trainer_ds.py", line 602, in train_epoch rank2: self.forward_step(model, batch, loss_dict=loss_dict) rank2: File "/lxc-data/FunASR/funasr/train_utils/trainer_ds.py", line 671, in forward_step rank2: retval = model(batch) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank2: return self._call_impl(*args, *kwargs) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank2: return forward_call(args, kwargs) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn rank2: ret_val = func(*args, kwargs) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1846, in forward rank2: loss = self.module(*inputs, *kwargs) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank2: return self._call_impl(args, kwargs) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank2: return forward_call(*args, kwargs) rank2: File "/lxc-data/FunASR/funasr/models/paraformer_streaming/model.py", line 121, in forward rank2: loss_att, acc_att, cer_att, wer_att, loss_pre, pre_loss_att = self._calc_att_predictor_loss( rank2: File "/lxc-data/FunASR/funasr/models/paraformer_streaming/model.py", line 279, in _calc_att_predictor_loss rank2: decoder_outs = self.decoder( rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank2: return self._call_impl(*args, *kwargs) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank2: return forward_call(args, kwargs) rank2: File "/lxc-data/FunASR/funasr/models/paraformer/decoder.py", line 397, in forward rank2: x, tgt_mask, memory, memorymask, = self.decoders(x, tgt_mask, memory, memory_mask) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank2: return self._call_impl(*args, kwargs) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank2: return forward_call(*args, kwargs) rank2: File "/lxc-data/FunASR/funasr/models/transformer/utils/repeat.py", line 32, in forward rank2: args = m(args) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank2: return self._call_impl(args, kwargs) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank2: return forward_call(*args, kwargs) rank2: File "/lxc-data/FunASR/funasr/models/paraformer/decoder.py", line 117, in forward rank2: x_src_attn = self.src_attn(x, memory, memory_mask, ret_attn=False) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank2: return self._call_impl(*args, *kwargs) rank2: File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank2: return forward_call(args, *kwargs) rank2: File "/lxc-data/FunASR/funasr/models/sanm/attention.py", line 717, in forward rank2: return self.forward_attention(v_h, scores, memory_mask, ret_attn=ret_attn) rank2: File "/lxc-data/FunASR/funasr/models/sanm/attention.py", line 694, in forward_attention rank2: x.transpose(1, 2).contiguous().view(n_batch, -1, self.h self.d_k) rank2: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 336.00 MiB. GPU has a total capacity of 79.15 GiB of which 141.25 MiB is free. Process 2109115 has 78.99 GiB memory in use. Of the allocated memory 77.21 GiB is allocated by PyTorch, and 283.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) liangxianchen-funasr-15wh-train-lxc0-w-0:15325:16199 [2] NCCL INFO [Service thread] Connection closed by localRank 2 liangxianchen-funasr-15wh-train-lxc0-w-0:15323:16203 [0] NCCL INFO [Service thread] Connection closed by localRank 2 liangxianchen-funasr-15wh-train-lxc0-w-0:15327:16195 [4] NCCL INFO [Service thread] Connection closed by localRank 2 liangxianchen-funasr-15wh-train-lxc0-w-0:15329:16189 [6] NCCL INFO [Service thread] Connection closed by localRank 2 [rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. [rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600038 milliseconds before timing out. [rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. liangxianchen-funasr-15wh-train-lxc0-w-0:15323:16171 [0] NCCL INFO [Service thread] Connection closed by localRank 4 liangxianchen-funasr-15wh-train-lxc0-w-0:15325:16160 [2] NCCL INFO [Service thread] Connection closed by localRank 4 liangxianchen-funasr-15wh-train-lxc0-w-0:15329:16158 [6] NCCL INFO [Service thread] Connection closed by localRank 4 liangxianchen-funasr-15wh-train-lxc0-w-0:15327:16162 [4] NCCL INFO [Service thread] Connection closed by localRank 4 liangxianchen-funasr-15wh-train-lxc0-w-0:15330:16168 [7] NCCL INFO [Service thread] Connection closed by localRank 7 liangxianchen-funasr-15wh-train-lxc0-w-0:15328:16164 [5] NCCL INFO [Service thread] Connection closed by localRank 5 liangxianchen-funasr-15wh-train-lxc0-w-0:15323:16171 [0] NCCL INFO [Service thread] Connection closed by localRank 5 liangxianchen-funasr-15wh-train-lxc0-w-0:15325:16160 [2] NCCL INFO [Service thread] Connection closed by localRank 5 liangxianchen-funasr-15wh-train-lxc0-w-0:15329:16158 [6] NCCL INFO [Service thread] Connection closed by localRank 7 liangxianchen-funasr-15wh-train-lxc0-w-0:15327:16162 [4] NCCL INFO [Service thread] Connection closed by localRank 5 liangxianchen-funasr-15wh-train-lxc0-w-0:15323:16171 [0] NCCL INFO [Service thread] Connection closed by localRank 7 liangxianchen-funasr-15wh-train-lxc0-w-0:15325:16160 [2] NCCL INFO [Service thread] Connection closed by localRank 7 liangxianchen-funasr-15wh-train-lxc0-w-0:15329:16158 [6] NCCL INFO [Service thread] Connection closed by localRank 6 liangxianchen-funasr-15wh-train-lxc0-w-0:15327:16162 [4] NCCL INFO [Service thread] Connection closed by localRank 7 liangxianchen-funasr-15wh-train-lxc0-w-0:15323:16171 [0] NCCL INFO [Service thread] Connection closed by localRank 6 liangxianchen-funasr-15wh-train-lxc0-w-0:15325:16160 [2] NCCL INFO [Service thread] Connection closed by localRank 6 liangxianchen-funasr-15wh-train-lxc0-w-0:15329:16158 [6] NCCL INFO [Service thread] Connection closed by localRank 5 liangxianchen-funasr-15wh-train-lxc0-w-0:15327:16162 [4] NCCL INFO [Service thread] Connection closed by localRank 6 liangxianchen-funasr-15wh-train-lxc0-w-0:15323:16171 [0] NCCL INFO [Service thread] Connection closed by localRank 0 liangxianchen-funasr-15wh-train-lxc0-w-0:15325:16160 [2] NCCL INFO [Service thread] Connection closed by localRank 0 liangxianchen-funasr-15wh-train-lxc0-w-0:15327:16162 [4] NCCL INFO [Service thread] Connection closed by localRank 0 liangxianchen-funasr-15wh-train-lxc0-w-0:15329:16158 [6] NCCL INFO [Service thread] Connection closed by localRank 0 liangxianchen-funasr-15wh-train-lxc0-w-0:15323:16171 [0] NCCL INFO [Service thread] Connection closed by localRank 3 liangxianchen-funasr-15wh-train-lxc0-w-0:15325:16160 [2] NCCL INFO [Service thread] Connection closed by localRank 3 liangxianchen-funasr-15wh-train-lxc0-w-0:15327:16162 [4] NCCL INFO [Service thread] Connection closed by localRank 3 liangxianchen-funasr-15wh-train-lxc0-w-0:15326:16166 [3] NCCL INFO [Service thread] Connection closed by localRank 3 liangxianchen-funasr-15wh-train-lxc0-w-0:15329:16158 [6] NCCL INFO [Service thread] Connection closed by localRank 3 liangxianchen-funasr-15wh-train-lxc0-w-0:15323:16171 [0] NCCL INFO [Service thread] Connection closed by localRank 1 liangxianchen-funasr-15wh-train-lxc0-w-0:15325:16160 [2] NCCL INFO [Service thread] Connection closed by localRank 1 liangxianchen-funasr-15wh-train-lxc0-w-0:15324:16156 [1] NCCL INFO [Service thread] Connection closed by localRank 1 liangxianchen-funasr-15wh-train-lxc0-w-0:15327:16162 [4] NCCL INFO [Service thread] Connection closed by localRank 1 liangxianchen-funasr-15wh-train-lxc0-w-0:15329:16158 [6] NCCL INFO [Service thread] Connection closed by localRank 1 liangxianchen-funasr-15wh-train-lxc0-w-0:15330:16097 [7] NCCL INFO comm 0xb669080 rank 7 nranks 16 cudaDev 7 busId e0000 - Abort COMPLETE liangxianchen-funasr-15wh-train-lxc0-w-0:15328:16099 [5] NCCL INFO comm 0xb774600 rank 5 nranks 16 cudaDev 5 busId a1000 - Abort COMPLETE [rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 7] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down. [rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 5] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f612177a897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f60d545a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f60d545efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f60d546031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f6120ec7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f61227f5ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f6122887850 in /lib/x86_64-linux-gnu/libc.so.6)

liangxianchen-funasr-15wh-train-lxc0-w-0:15326:16093 [3] NCCL INFO comm 0xa7b60c0 rank 3 nranks 16 cudaDev 3 busId 4e000 - Abort COMPLETE [rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down. [rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 3] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8b700cf897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8b23e5a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8b23e5efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8b23e6031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f8b6f8c7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f8b711bbac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f8b7124d850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74c517a897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7478e5a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7478e5efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7478e6031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f74c48c7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f74c627fac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f74c6311850 in /lib/x86_64-linux-gnu/libc.so.6)

liangxianchen-funasr-15wh-train-lxc0-w-0:15324:16101 [1] NCCL INFO comm 0xa851c00 rank 1 nranks 16 cudaDev 1 busId 10000 - Abort COMPLETE [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 1] Timeout at NCCL work: 15734, last enqueued NCCL work: 15735, last completed NCCL work: 15733. [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15734, OpType=ALLREDUCE, NumelIn=220084533, NumelOut=220084533, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fab8c77a897 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fab4045a1b2 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fab4045efd0 in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fab4046031c in /lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7fab8bec7bf4 in /lxc-data/minianconda3/envs/python39/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fab8d849ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fab8d8db850 in /lib/x86_64-linux-gnu/libc.so.6)

rank2:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 2] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0 rank2:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 2] ProcessGroupNCCL preparing to dump debug info. W0812 23:21:02.886410 139807001610048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 15323 closing signal SIGTERM W0812 23:21:02.893118 139807001610048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 15325 closing signal SIGTERM W0812 23:21:02.897074 139807001610048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 15326 closing signal SIGTERM W0812 23:21:02.898950 139807001610048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 15327 closing signal SIGTERM W0812 23:21:02.902364 139807001610048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 15328 closing signal SIGTERM W0812 23:21:02.906480 139807001610048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 15329 closing signal SIGTERM W0812 23:21:02.910654 139807001610048 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 15330 closing signal SIGTERM E0812 23:21:23.374342 139807001610048 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 15324) of binary: /lxc-data/minianconda3/envs/python39/bin/python Traceback (most recent call last): File "/lxc-data/minianconda3/envs/python39/bin/torchrun", line 8, in sys.exit(main()) File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, kwargs) File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/lxc-data/minianconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../../funasr/bin/train_ds.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-12_23:21:02 host : liangxianchen-funasr-15wh-train-lxc0-w-0.liangxianchen-funasr-15wh-train-lxc0.hbox-aigc.svc.hbox2-zzzc2-prd.local rank : 1 (local_rank: 1) exitcode : -6 (pid: 15324) error_file: traceback : Signal 6 (SIGABRT) received by PID 15324 ============================================================ #### What have you tried? 尝试增加超时时间，以及降低batchsize，但发现大部分情况下都不会cuda 显存报错，显存只用了1/2，但有个step会显存溢出，导致该节点报错，最终导致训练NCCL超时，但我看dataloader加载是CustomDistributedBufferDynamicBatchSampler 方法，为什么各个step的显存会有明显变化 export TORCH_NCCL_BLOCKING_WAIT=1 # # export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 export NCCL_ASYNC_ERROR_HANDLING=1 if use_deepspeed: logging.info(f"use_deepspeed: {use_deepspeed}") os.environ['NCCL_BLOCKING_WAIT'] = '1' os.environ['TORCH_NCCL_BLOCKING_WAIT'] = '1' deepspeed.init_distributed(dist_backend=kwargs.get("backend", "nccl"),timeout=timedelta(seconds=7200000)) elif use_ddp or use_fsdp: logging.info(f"use_ddp: {use_ddp}, use_fsdp: {use_fsdp}") os.environ['NCCL_BLOCKING_WAIT'] = '1' os.environ['TORCH_NCCL_BLOCKING_WAIT'] = '1‘ dist.init_process_group(backend=kwargs.get("backend", "nccl"), init_method="env://",timeout=timedelta(seconds=7200000)) torch.cuda.set_device(local_rank) #### What's your environment? CUDA 12.4 torch 2.1 python3.9 - OS (e.g., Linux): - FunASR Version (e.g., 1.0.0): - ModelScope Version (e.g., 1.11.0): - PyTorch Version (e.g., 2.0.0): - How you installed funasr (`pip`, source): - Python version: - GPU (e.g., V100M32) - CUDA/cuDNN version (e.g., cuda11.7): - Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1) - Any other relevant information:

modelscope / FunASR

NCCL 多级多卡训练超时问题 #2006

❓ Questions and Help

Before asking:

What is your question?

Code

../../../funasr/bin/train_ds.py FAILED

../../../funasr/bin/train_ds.py FAILED

modelscope / FunASR

NCCL 多级多卡 训练 超时问题 #2006

❓ Questions and Help

Before asking:

What is your question?

Code

../../../funasr/bin/train_ds.py FAILED

../../../funasr/bin/train_ds.py FAILED

NCCL 多级多卡训练超时问题 #2006