hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.74k stars 4.34k forks source link

[BUG]: RuntimeError: CUDA error: unknown error #3584

Open LYMDLUT opened 1 year ago

LYMDLUT commented 1 year ago

🐛 Describe the bug

[04/17/23 20:35:20] INFO colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
[04/17/23 20:35:22] INFO colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/context/parallel_context.py:558 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/initialize.py:115 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1 Files already downloaded and verified Files already downloaded and verified [04/17/23 20:35:27] INFO colossalai - ProcessGroup - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch
ProcessGroup Init:
backend: nccl
ranks: [0]
[extension] OP colossalai._C.cpu_adam has been compileed ahead of time, skip building. [extension] OP colossalai._C.fused_optim has been compileed ahead of time, skip building. [04/17/23 20:35:28] INFO colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/zero/low_level/low_level_optim.p y:251 _partition_param_list
INFO colossalai - colossalai - INFO: Number of elements on ranks: [23712932]
################## Traceback (most recent call last): File "/mnt/d/CIFAR100_timm/train_fgsm_colossai.py", line 236, in main() File "/mnt/d/CIFAR100_timm/train_fgsm_colossai.py", line 165, in main output = model(X + delta[:X.size(0)]) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/timm/models/resnet.py", line 730, in forward x = self.forward_features(x) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/timm/models/resnet.py", line 718, in forward_features x = self.layer2(x) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/timm/models/resnet.py", line 463, in forward shortcut = self.downsample(shortcut) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward return F.conv2d(input, weight, bias, self.stride, File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 91, in torch_function return super().torch_function(func, types, args, kwargs) File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 193, in __torch_function__ ret = func(args, kwargs) RuntimeError: CUDA error: unknown error

Environment

1060 6g i7 8750H WSL2 ubuntu20.04 不是代码的bug,代码刚写好可以运行,这个环境运行一段时间就会unkown error

JThh commented 1 year ago

You may try to decrease the batch size (if it is not already one). This might be due to CUDA OOM.