hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible

https://www.colossalai.org

Apache License 2.0

38.78k stars 4.34k forks source link

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary: #3556

Open Haoran1234567 opened 1 year ago

Haoran1234567 commented 1 year ago

🐛 Describe the bug

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary:

[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807582 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809346 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805522 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804789 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807703 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805474 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804809 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 514949 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary: /home/qihaoran/.conda/envs/coati_test/bin/python Traceback (most recent call last): File "/home/qihaoran/.conda/envs/coati_test/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')()) File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_sft.py FAILED

Failures: [1]: time : 2023-04-13_10:28:15 host : gpu8 rank : 1 (local_rank: 1) exitcode : -6 (pid: 514947) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 514947 [2]: time : 2023-04-13_10:28:15 host : gpu8 rank : 4 (local_rank: 4) exitcode : -6 (pid: 514950) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 514950

Root Cause (first observed failure): [0]: time : 2023-04-13_10:28:15 host : gpu8 rank : 0 (local_rank: 0) exitcode : -6 (pid: 514946) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 514946

Environment

Colossal-AI version: 0.2.8 PyTorch version: 1.13.0 System CUDA version: 11.7 CUDA version required by PyTorch: 11.7

Haoran1234567 commented 1 year ago

When I train train_sft with very small training data instructino_wild, it can run normally. When I increase some data, this happens, but the increased data is not much.

i use 3090 8*24g

Haoran1234567 commented 1 year ago

Here is my run script： torchrun --standalone --nproc_per_node=8 train_sft.py \ --pretrain $PRETRAIN \ --model 'llama' \ --strategy colossalai_zero2 \ --log_interval 10 \ --save_path $SAVE_PATH \ --dataset $DATASET \ --batch_size 2 \ --accimulation_steps 16 \ --lr 2e-5 \ --max_datasets_size 512 \ --max_epochs 3

and i set elif args.strategy == 'colossalai_zero2': strategy = ColossalAIStrategy(stage=2, placement_policy='cpu'),because when placement_policy='cuda',OOM error

xienan0326 commented 1 year ago

meet the same problem, have you resolved it

Camille7777 commented 1 year ago

Here is my run script： torchrun --standalone --nproc_per_node=8 train_sft.py --pretrain $PRETRAIN --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path $SAVE_PATH --dataset $DATASET --batch_size 2 --accimulation_steps 16 --lr 2e-5 --max_datasets_size 512 --max_epochs 3

and i set elif args.strategy == 'colossalai_zero2': strategy = ColossalAIStrategy(stage=2, placement_policy='cpu'),because when placement_policy='cuda',OOM error

hi, @Haoran1234567 , can you check the memory usage during your sft training and how many data you actually increase? BTW, you can directly set --strategy colossalai_zero2_cpu to set different placement policy.

Camille7777 commented 1 year ago

meet the same problem, have you resolved it

@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :)

xienan0326 commented 1 year ago

meet the same problem, have you resolved it

@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :) detail, plz

Map: 96%|██████████████████████████████████████████████████WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5620 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5622 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 5621) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures: [1]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 3 (local_rank: 3) exitcode : -7 (pid: 5623) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5623

Root Cause (first observed failure): [0]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 1 (local_rank: 1) exitcode : -7 (pid: 5621) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5621

Camille7777 commented 1 year ago

meet the same problem, have you resolved it

@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :) detail, plz

Map: 96%|██████████████████████████████████████████████████WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5620 closing signal SIGTERM

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5622 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 5621) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures:

[1]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 3 (local_rank: 3) exitcode : -7 (pid: 5623) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5623

Root Cause (first observed failure):

[0]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 1 (local_rank: 1) exitcode : -7 (pid: 5621) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5621

hii, have you check with the memory usage during the training? It is likely cpu runs out of memory. You can try allocating more main memory and run it again.

xienan0326 commented 1 year ago

0%| | 0/21 [00:00<?, ?it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1318 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 1315) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

fastchat/train/train_mem.py FAILED

Failures: [1]: time : 2023-04-20_11:37:39 host : ab00e35df170 rank : 1 (local_rank: 1) exitcode : -7 (pid: 1316) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1316 [2]: time : 2023-04-20_11:37:39 host : ab00e35df170 rank : 2 (local_rank: 2) exitcode : -7 (pid: 1317) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1317

Root Cause (first observed failure): [0]: time : 2023-04-20_11:37:39 host : ab00e35df170 rank : 0 (local_rank: 0) exitcode : -7 (pid: 1315) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1315

xienan0326 commented 1 year ago

meet the same problem, have you resolved it

@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :) detail, plz

Map: 96%|██████████████████████████████████████████████████WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5620 closing signal SIGTERM

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5622 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 5621) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures:

[1]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 3 (local_rank: 3) exitcode : -7 (pid: 5623) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5623

Root Cause (first observed failure):

[0]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 1 (local_rank: 1) exitcode : -7 (pid: 5621) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5621

hii, have you check with the memory usage during the training? It is likely cpu runs out of memory. You can try allocating more main memory and run it again. It doesn't seem like a memory issue, the peak is 90g

Camille7777 commented 1 year ago

meet the same problem, have you resolved it

@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :) detail, plz

Map: 96%|██████████████████████████████████████████████████WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5620 closing signal SIGTERM

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5622 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 5621) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures:

[1]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 3 (local_rank: 3) exitcode : -7 (pid: 5623) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5623

Root Cause (first observed failure):

[0]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 1 (local_rank: 1) exitcode : -7 (pid: 5621) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5621

hii, have you check with the memory usage during the training? It is likely cpu runs out of memory. You can try allocating more main memory and run it again. It doesn't seem like a memory issue, the peak is 90g

Hi, @xienan0326 can you provide your running command for this?

xienan0326 commented 1 year ago

meet the same problem, have you resolved it

@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :) detail, plz

Map: 96%|██████████████████████████████████████████████████WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5620 closing signal SIGTERM

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5622 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 5621) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures:

[1]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 3 (local_rank: 3) exitcode : -7 (pid: 5623) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5623

Root Cause (first observed failure):

[0]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 1 (local_rank: 1) exitcode : -7 (pid: 5621) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5621

hii, have you check with the memory usage during the training? It is likely cpu runs out of memory. You can try allocating more main memory and run it again. It doesn't seem like a memory issue, the peak is 90g

Hi, @xienan0326 can you provide your running command for this? Thank you.This bug has been fixed. by "--shm-size="1g"". but，i got Another bug. very sad, plz help me terminate called after throwing an instance of 'c10::Error' what(): CUDA error: driver shutting down CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f28eac4df57 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f28eac12abb in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f28eacf2158 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10::cuda::SetDevice(int) + 0x3d (0x7f28eacf278d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #4: std::_Sp_counted_ptr_inplace<std::vector<at::cuda::CUDAEvent, std::allocator >, std::allocator<std::vector<at::cuda::CUDAEvent, std::allocator > >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x9a (0x7f28ebefb0ea in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x3b0 (0x7f28ebec8b30 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x20a (0x7f28ebedc82a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x75 (0x7f28ebedcaa5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #8: + 0xd6de4 (0x7f2942e3fde4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #9: + 0x8609 (0x7f2962754609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) frame #10: clone + 0x43 (0x7f296288e133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 2 (pid: 113) of binary: /usr/bin/python3.10

quantumiracle commented 7 months ago

Hi,

Same issue here. Is this problem solved?

chenchen333-dev commented 7 months ago

🐛 描述错误

错误：torch.distributed.elastic.multiprocessing.api：失败（退出代码：-6）local_rank：0（pid：514946）二进制文件：

[E ProcessGroupNCCL.cpp:821] [Rank 0] 看门狗捕获集体操作超时：WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1807582 毫秒。

[E ProcessGroupNCCL.cpp:821] [排名 1] 看门狗捕获集体操作超时：WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1809346 毫秒。 [E ProcessGroupNCCL.cpp:821] [排名 6] 看门狗捕获集体操作超时：WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1805522 毫秒。 [E ProcessGroupNCCL.cpp:821] [排名 7] 看门狗捕获集体操作超时：WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1804789 毫秒。 [E ProcessGroupNCCL.cpp:821] [排名 3] 看门狗捕获集体操作超时：WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1807703 毫秒。 [E ProcessGroupNCCL.cpp:821] [排名 5] 看门狗捕获集体操作超时：WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1805474 毫秒。 [E ProcessGroupNCCL.cpp:821] [排名 4] 看门狗捕获集体操作超时：WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1804809 毫秒。 [E ProcessGroupNCCL.cpp:456] 某些 NCCL 操作失败或超时。由于 CUDA 内核的异步特性，后续 GPU 操作可能会在损坏/不完整的数据上运行。 [E ProcessGroupNCCL.cpp:461] 为了避免数据不一致，我们将取消整个流程。 [E ProcessGroupNCCL.cpp:456] 某些 NCCL 操作失败或超时。由于 CUDA 内核的异步特性，后续 GPU 操作可能会在损坏/不完整的数据上运行。 [E ProcessGroupNCCL.cpp:461] 为了避免数据不一致，我们将取消整个流程。 [E ProcessGroupNCCL.cpp:456] 某些 NCCL 操作失败或超时。由于 CUDA 内核的异步特性，后续 GPU 操作可能会在损坏/不完整的数据上运行。 [E ProcessGroupNCCL.cpp:461] 为了避免数据不一致，我们将取消整个流程。警告：torch.distributed.elastic.multiprocessing.api：发送进程 514949 关闭信号 SIGTERM 错误：torch.distributed.elastic.multiprocessing.api：失败（退出代码：-6） local_rank：0（pid：514946）二进制文件：/home /qihaoran/.conda/envs/coati_test/bin/python 回溯（最近一次调用）：文件“/home/qihaoran/.conda/envs/coati_test/bin/torchrun”，第 33 行，在 sys.exit(load_entry_point( '火炬==1.13.1'，'console_scripts'，'torchrun'）（））文件“/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/ init .py”，第 346 行，在包装器中 return f(*args, kwargs）文件“/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py”，第762行，在主运行（args）中文件“/home /qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py”，第753行，运行 elastic_launch( 文件“/home/qihaoran/.conda/envs/coati_test/ lib/python3.10/site-packages/torch/distributed/launcher/api.py”，第 132 行，调用** 中 return launch_agent(self._config, self._entrypoint, list(args)) 文件“/home/qihaoran/. conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/launcher/api.py”，第 246 行，在 launch_agent 中引发 ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_sft.py 失败

失败：

[1]：时间：2023-04-13_10：28：15 主机：gpu8 排名：1（local_rank：1）退出代码：-6（pid：514947） error_file：<N/A> 回溯：信号6（SIGABRT））由 PID 514947 [2] 接收：时间：2023-04-13_10:28:15 主机：gpu8 等级：4（local_rank：4）退出代码：-6（pid：514950）错误文件：<N/A> 回溯：信号PID 514950 收到 6 (SIGABRT) 根本原因（第一次观察到的故障）： [0]：时间：2023-04-13_10：28：15 主机：gpu8 等级：0（local_rank：0）退出代码：-6（pid：514946） error_file：<N/A> 回溯：PID 514946 收到信号 6 (SIGABRT)

环境

Colossal-AI 版本：0.2.8 PyTorch 版本：1.13.0 系统 CUDA 版本：11.7 PyTorch 所需的 CUDA 版本：11.7

请问解决了吗？我也是报下面的错误，torch是2.2的版本，batch_size设为1

Issues-translate-bot commented 7 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

🐛 Description error

Error: torch.distributed.elastic.multiprocessing.api: failed (exit code: -6) local_rank: 0 (pid: 514946) Binary:

[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807582 milliseconds before timing out.

[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809346 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 6] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805522 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 7] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804789 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807703 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 5] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805474 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804809 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations may run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we will cancel the entire process. [E ProcessGroupNCCL.cpp:456] Some NCCL operations failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations may run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we will cancel the entire process. [E ProcessGroupNCCL.cpp:456] Some NCCL operations failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations may run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistencies, we will cancel the entire process. Warning: torch.distributed.elastic.multiprocessing.api: sending process 514949 shutdown signal SIGTERM Error: torch.distributed.elastic.multiprocessing.api: failed (exit code: -6) local_rank: 0 (pid: 514946) Binary: /home/qihaoran/.conda/envs/coati_test/bin/python Traceback (most recent call): File "/home/qihaoran/.conda/envs/coati_test/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.13.1','console_scripts','torchrun')()) File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/ init .py", line 346, in packages in the vessel return f(*args, kwargs) File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in the main Running (args) File "/home /qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py", line 753, run elastic_launch( File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, Call** Medium return launch_agent(self._config, self._entrypoint, list(args)) File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_sft.py failed

Failure:

[1]: Time: 2023-04-13_10:28:15 Host: gpu8 Rank: 1 (local_rank: 1) Exit code: -6 (pid: 514947) error_file:<N/A> Traceback: Signal 6 (SIGABRT) ) by PID 514947 [2] Receive: Time: 2023-04-13_10:28:15 Host: gpu8 Rank: 4 (local_rank: 4) Exit code: -6 (pid: 514950) Error file: <N/A> Traceback: Signal PID 514950 received 6 (SIGABRT) Root cause (first observed failure): [0]: Time: 2023-04-13_10:28:15 Host: gpu8 Rank: 0 (local_rank: 0) Exit code: -6 (pid: 514946) error_file :<N/A> Traceback: PID 514946 Signal 6 (SIGABRT) received

Environment

Colossal-AI version: 0.2.8 PyTorch version: 1.13.0 System CUDA version: 11.7 PyTorch required CUDA version: 11.7

Has it been resolved? I also reported the following error. torch is version 2.2 and batch_size is set to 1.