Open Haoran1234567 opened 1 year ago
When I train train_sft with very small training data instructino_wild, it can run normally. When I increase some data, this happens, but the increased data is not much.
i use 3090 8*24g
Here is my run script: torchrun --standalone --nproc_per_node=8 train_sft.py \ --pretrain $PRETRAIN \ --model 'llama' \ --strategy colossalai_zero2 \ --log_interval 10 \ --save_path $SAVE_PATH \ --dataset $DATASET \ --batch_size 2 \ --accimulation_steps 16 \ --lr 2e-5 \ --max_datasets_size 512 \ --max_epochs 3
and i set elif args.strategy == 'colossalai_zero2': strategy = ColossalAIStrategy(stage=2, placement_policy='cpu'),because when placement_policy='cuda',OOM error
meet the same problem, have you resolved it
Here is my run script: torchrun --standalone --nproc_per_node=8 train_sft.py --pretrain $PRETRAIN --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path $SAVE_PATH --dataset $DATASET --batch_size 2 --accimulation_steps 16 --lr 2e-5 --max_datasets_size 512 --max_epochs 3
and i set elif args.strategy == 'colossalai_zero2': strategy = ColossalAIStrategy(stage=2, placement_policy='cpu'),because when placement_policy='cuda',OOM error
hi, @Haoran1234567 , can you check the memory usage during your sft training and how many data you actually increase?
BTW, you can directly set --strategy colossalai_zero2_cpu
to set different placement policy.
meet the same problem, have you resolved it
@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :)
meet the same problem, have you resolved it
@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :) detail, plz
meet the same problem, have you resolved it
@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :) detail, plz
Map: 96%|██████████████████████████████████████████████████WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5620 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5622 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 5621) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED
Failures:
[1]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 3 (local_rank: 3) exitcode : -7 (pid: 5623) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5623
Root Cause (first observed failure):
[0]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 1 (local_rank: 1) exitcode : -7 (pid: 5621) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5621
hii, have you check with the memory usage during the training? It is likely cpu runs out of memory. You can try allocating more main memory and run it again.
meet the same problem, have you resolved it
@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :) detail, plz
Map: 96%|██████████████████████████████████████████████████WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5620 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5622 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 5621) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED
Failures:
[1]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 3 (local_rank: 3) exitcode : -7 (pid: 5623) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5623
Root Cause (first observed failure):
[0]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 1 (local_rank: 1) exitcode : -7 (pid: 5621) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5621
hii, have you check with the memory usage during the training? It is likely cpu runs out of memory. You can try allocating more main memory and run it again. It doesn't seem like a memory issue, the peak is 90g
meet the same problem, have you resolved it
@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :) detail, plz
Map: 96%|██████████████████████████████████████████████████WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5620 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5622 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 5621) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED
Failures:
[1]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 3 (local_rank: 3) exitcode : -7 (pid: 5623) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5623
Root Cause (first observed failure):
[0]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 1 (local_rank: 1) exitcode : -7 (pid: 5621) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5621
hii, have you check with the memory usage during the training? It is likely cpu runs out of memory. You can try allocating more main memory and run it again. It doesn't seem like a memory issue, the peak is 90g
Hi, @xienan0326 can you provide your running command for this?
meet the same problem, have you resolved it
@xienan0326 can you provide more information about this error? Actually your exitcode is different from his/her. :) detail, plz
Map: 96%|██████████████████████████████████████████████████WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5620 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5622 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 5621) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED
Failures:
[1]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 3 (local_rank: 3) exitcode : -7 (pid: 5623) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5623
Root Cause (first observed failure):
[0]: time : 2023-04-18_08:32:53 host : e27e4dca5000 rank : 1 (local_rank: 1) exitcode : -7 (pid: 5621) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 5621
hii, have you check with the memory usage during the training? It is likely cpu runs out of memory. You can try allocating more main memory and run it again. It doesn't seem like a memory issue, the peak is 90g
Hi, @xienan0326 can you provide your running command for this? Thank you.This bug has been fixed. by "--shm-size="1g"". but,i got Another bug. very sad, plz help me terminate called after throwing an instance of 'c10::Error' what(): CUDA error: driver shutting down CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f28eac4df57 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f28eac12abb in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f28eacf2158 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::cuda::SetDevice(int) + 0x3d (0x7f28eacf278d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: std::_Sp_counted_ptr_inplace<std::vector<at::cuda::CUDAEvent, std::allocator
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 2 (pid: 113) of binary: /usr/bin/python3.10
Hi,
Same issue here. Is this problem solved?
🐛 描述错误
错误:torch.distributed.elastic.multiprocessing.api:失败(退出代码:-6)local_rank:0(pid:514946)二进制文件:
[E ProcessGroupNCCL.cpp:821] [Rank 0] 看门狗捕获集体操作超时:WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1807582 毫秒。
[E ProcessGroupNCCL.cpp:821] [排名 1] 看门狗捕获集体操作超时:WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1809346 毫秒。 [E ProcessGroupNCCL.cpp:821] [排名 6] 看门狗捕获集体操作超时:WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1805522 毫秒。 [E ProcessGroupNCCL.cpp:821] [排名 7] 看门狗捕获集体操作超时:WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1804789 毫秒。 [E ProcessGroupNCCL.cpp:821] [排名 3] 看门狗捕获集体操作超时:WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1807703 毫秒。 [E ProcessGroupNCCL.cpp:821] [排名 5] 看门狗捕获集体操作超时:WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1805474 毫秒。 [E ProcessGroupNCCL.cpp:821] [排名 4] 看门狗捕获集体操作超时:WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) 在超时之前运行了 1804809 毫秒。 [E ProcessGroupNCCL.cpp:456] 某些 NCCL 操作失败或超时。由于 CUDA 内核的异步特性,后续 GPU 操作可能会在损坏/不完整的数据上运行。 [E ProcessGroupNCCL.cpp:461] 为了避免数据不一致,我们将取消整个流程。 [E ProcessGroupNCCL.cpp:456] 某些 NCCL 操作失败或超时。由于 CUDA 内核的异步特性,后续 GPU 操作可能会在损坏/不完整的数据上运行。 [E ProcessGroupNCCL.cpp:461] 为了避免数据不一致,我们将取消整个流程。 [E ProcessGroupNCCL.cpp:456] 某些 NCCL 操作失败或超时。由于 CUDA 内核的异步特性,后续 GPU 操作可能会在损坏/不完整的数据上运行。 [E ProcessGroupNCCL.cpp:461] 为了避免数据不一致,我们将取消整个流程。 警告:torch.distributed.elastic.multiprocessing.api:发送进程 514949 关闭信号 SIGTERM 错误:torch.distributed.elastic.multiprocessing.api:失败(退出代码:-6) local_rank:0(pid:514946)二进制文件:/home /qihaoran/.conda/envs/coati_test/bin/python 回溯(最近一次调用): 文件“/home/qihaoran/.conda/envs/coati_test/bin/torchrun”,第 33 行,在 sys.exit(load_entry_point( '火炬==1.13.1','console_scripts','torchrun')()) 文件“/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/ init .py”,第 346 行,在包装器中 return f(*args, kwargs) 文件“/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py”,第762行,在主 运行(args)中 文件“/home /qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py”,第753行,运行 elastic_launch( 文件“/home/qihaoran/.conda/envs/coati_test/ lib/python3.10/site-packages/torch/distributed/launcher/api.py”,第 132 行,调用** 中 return launch_agent(self._config, self._entrypoint, list(args)) 文件“/home/qihaoran/. conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/launcher/api.py”,第 246 行,在 launch_agent 中 引发 ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_sft.py 失败
失败:
[1]: 时间:2023-04-13_10:28:15 主机:gpu8 排名:1(local_rank:1) 退出代码:-6(pid:514947) error_file:<N/A> 回溯:信号6(SIGABRT) )由 PID 514947 [2] 接收: 时间:2023-04-13_10:28:15 主机:gpu8 等级:4(local_rank:4) 退出代码:-6(pid:514950) 错误文件:<N/A> 回溯:信号PID 514950 收到 6 (SIGABRT) 根本原因(第一次观察到的故障): [0]: 时间:2023-04-13_10:28:15 主机:gpu8 等级:0(local_rank:0) 退出代码:-6(pid:514946) error_file:<N/A> 回溯:PID 514946 收到信号 6 (SIGABRT)
环境
Colossal-AI 版本:0.2.8 PyTorch 版本:1.13.0 系统 CUDA 版本:11.7 PyTorch 所需的 CUDA 版本:11.7
请问解决了吗?我也是报下面的错误,torch是2.2的版本,batch_size设为1
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
🐛 Description error
Error: torch.distributed.elastic.multiprocessing.api: failed (exit code: -6) local_rank: 0 (pid: 514946) Binary:
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807582 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809346 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 6] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805522 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 7] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804789 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807703 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 5] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805474 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog capture collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804809 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations may run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we will cancel the entire process. [E ProcessGroupNCCL.cpp:456] Some NCCL operations failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations may run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we will cancel the entire process. [E ProcessGroupNCCL.cpp:456] Some NCCL operations failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations may run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistencies, we will cancel the entire process. Warning: torch.distributed.elastic.multiprocessing.api: sending process 514949 shutdown signal SIGTERM Error: torch.distributed.elastic.multiprocessing.api: failed (exit code: -6) local_rank: 0 (pid: 514946) Binary: /home/qihaoran/.conda/envs/coati_test/bin/python Traceback (most recent call): File "/home/qihaoran/.conda/envs/coati_test/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.13.1','console_scripts','torchrun')()) File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/ init .py", line 346, in packages in the vessel return f(*args, kwargs) File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in the main Running (args) File "/home /qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py", line 753, run elastic_launch( File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, Call** Medium return launch_agent(self._config, self._entrypoint, list(args)) File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_sft.py failed
Failure:
[1]: Time: 2023-04-13_10:28:15 Host: gpu8 Rank: 1 (local_rank: 1) Exit code: -6 (pid: 514947) error_file:<N/A> Traceback: Signal 6 (SIGABRT) ) by PID 514947 [2] Receive: Time: 2023-04-13_10:28:15 Host: gpu8 Rank: 4 (local_rank: 4) Exit code: -6 (pid: 514950) Error file: <N/A> Traceback: Signal PID 514950 received 6 (SIGABRT) Root cause (first observed failure): [0]: Time: 2023-04-13_10:28:15 Host: gpu8 Rank: 0 (local_rank: 0) Exit code: -6 (pid: 514946) error_file :<N/A> Traceback: PID 514946 Signal 6 (SIGABRT) received
Environment
Colossal-AI version: 0.2.8 PyTorch version: 1.13.0 System CUDA version: 11.7 PyTorch required CUDA version: 11.7
Has it been resolved? I also reported the following error. torch is version 2.2 and batch_size is set to 1.
🐛 Describe the bug
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary:
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807582 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809346 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805522 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804789 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807703 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805474 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804809 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 514949 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary: /home/qihaoran/.conda/envs/coati_test/bin/python Traceback (most recent call last): File "/home/qihaoran/.conda/envs/coati_test/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/qihaoran/.conda/envs/coati_test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_sft.py FAILED
Failures: [1]: time : 2023-04-13_10:28:15 host : gpu8 rank : 1 (local_rank: 1) exitcode : -6 (pid: 514947) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 514947 [2]: time : 2023-04-13_10:28:15 host : gpu8 rank : 4 (local_rank: 4) exitcode : -6 (pid: 514950) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 514950
Root Cause (first observed failure): [0]: time : 2023-04-13_10:28:15 host : gpu8 rank : 0 (local_rank: 0) exitcode : -6 (pid: 514946) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 514946
Environment
Colossal-AI version: 0.2.8 PyTorch version: 1.13.0 System CUDA version: 11.7 CUDA version required by PyTorch: 11.7