[04/17/23 20:35:20] INFO colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p
ackages/colossalai/context/parallel_context.py:522
set_device
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
[04/17/23 20:35:22] INFO colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p
ackages/colossalai/context/parallel_context.py:558
set_seed
INFO colossalai - colossalai - INFO: initialized seed on
rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p
ackages/colossalai/initialize.py:115 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 1,
pipeline parallel size: 1, tensor parallel size: 1
Files already downloaded and verified
Files already downloaded and verified
[04/17/23 20:35:27] INFO colossalai - ProcessGroup - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p
ackages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch
ProcessGroup Init:
backend: nccl
ranks: [0]
[extension] OP colossalai._C.cpu_adam has been compileed ahead of time, skip building.
[extension] OP colossalai._C.fused_optim has been compileed ahead of time, skip building.
[04/17/23 20:35:28] INFO colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p
ackages/colossalai/zero/low_level/low_level_optim.p
y:251 _partition_param_list
INFO colossalai - colossalai - INFO: Number of elements
on ranks: [23712932]
##################
Traceback (most recent call last):
File "/mnt/d/CIFAR100_timm/train_fgsm_colossai.py", line 236, in
main()
File "/mnt/d/CIFAR100_timm/train_fgsm_colossai.py", line 165, in main
output = model(X + delta[:X.size(0)])
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/timm/models/resnet.py", line 730, in forward
x = self.forward_features(x)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/timm/models/resnet.py", line 718, in forward_features
x = self.layer2(x)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, *kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/timm/models/resnet.py", line 463, in forward
shortcut = self.downsample(shortcut)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, *kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 91, in torch_function
return super().torch_function(func, types, args, kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 193, in __torch_function__
ret = func(args, kwargs)
RuntimeError: CUDA error: unknown error
🐛 Describe the bug
[04/17/23 20:35:20] INFO colossalai - colossalai - INFO:
main()
File "/mnt/d/CIFAR100_timm/train_fgsm_colossai.py", line 165, in main
output = model(X + delta[:X.size(0)])
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/timm/models/resnet.py", line 730, in forward
x = self.forward_features(x)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/timm/models/resnet.py", line 718, in forward_features
x = self.layer2(x)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, *kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/timm/models/resnet.py", line 463, in forward
shortcut = self.downsample(shortcut)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, *kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 91, in torch_function
return super().torch_function(func, types, args, kwargs)
File "/home/lym/miniconda3/envs/lab3/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 193, in __torch_function__
ret = func(args, kwargs)
RuntimeError: CUDA error: unknown error
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
[04/17/23 20:35:22] INFO colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/context/parallel_context.py:558 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/initialize.py:115 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1 Files already downloaded and verified Files already downloaded and verified [04/17/23 20:35:27] INFO colossalai - ProcessGroup - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch
ProcessGroup Init:
backend: nccl
ranks: [0]
[extension] OP colossalai._C.cpu_adam has been compileed ahead of time, skip building. [extension] OP colossalai._C.fused_optim has been compileed ahead of time, skip building. [04/17/23 20:35:28] INFO colossalai - colossalai - INFO:
/home/lym/miniconda3/envs/lab3/lib/python3.9/site-p ackages/colossalai/zero/low_level/low_level_optim.p y:251 _partition_param_list
INFO colossalai - colossalai - INFO: Number of elements on ranks: [23712932]
################## Traceback (most recent call last): File "/mnt/d/CIFAR100_timm/train_fgsm_colossai.py", line 236, in
Environment
1060 6g i7 8750H WSL2 ubuntu20.04 不是代码的bug,代码刚写好可以运行,这个环境运行一段时间就会unkown error