Hi, thanks for your great work in combining MAE with hierarchical vision transformers!
However, when I tried to reproduce the results using your code, I encountered a CUDA error when training MAE with Swin. Here is a part of the log.
[09:52:59.848646] base lr: 1.50e-04
[09:52:59.848669] actual lr: 1.87e-05
[09:52:59.848696] accumulate grad iterations: 1
[09:52:59.848714] effective batch size: 32
[09:52:59.905504] AdamW (
Parameter Group 0
amsgrad: False
betas: (0.9, 0.95)
eps: 1e-08
lr: 1.875e-05
weight_decay: 0.0
Parameter Group 1
amsgrad: False
betas: (0.9, 0.95)
eps: 1e-08
lr: 1.875e-05
weight_decay: 0.05
)
[09:52:59.906380] Checkpoint not founded in /data/code/pretrain/checkpoints/pretrain/simmim_swin_tiny_256_um_simmim_bs2048_ep200_temp.pth, train from random initialization
[09:52:59.906428] Start training for 200 epochs
[09:52:59.907566] log_dir: /data/code/pretrain/tb/simmim/simmim_swin_tiny_256_um_simmim_bs2048_ep200
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can a
dversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have
unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can a
dversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have
unused parameters. (function operator())
Traceback (most recent call last):
File "main_pretrain.py", line 380, in <module>
main(args)
File "main_pretrain.py", line 266, in main
train_stats = train_one_epoch(
File "/home/wanghaochen/project/UM-MAE/engine_pretrain.py", line 58, in train_one_epoch
loss_scaler(loss, optimizer, parameters=model.parameters(),
File "/home/wanghaochen/project/UM-MAE/util/misc.py", line 256, in __call__
self._scaler.scale(loss).backward(create_graph=create_graph)
File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 32, 64, 64], dtype=torch.half, device='cuda', requires_grad=True).to(memory_format=torch.channels_last)
net = torch.nn.Conv2d(32, 24, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half().to(memory_format=torch.channels_last)
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [0, 0, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7f4c4403d690
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 1, 32, 64, 64,
strideA = 131072, 1, 2048, 32,
output: TensorDescriptor 0x7f4c4403d2b0
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 1, 24, 64, 64,
strideA = 98304, 1, 1536, 24,
weight: FilterDescriptor 0x7f4c440688f0
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NHWC
nbDims = 4
dimA = 24, 32, 1, 1,
Pointer addresses:
input: 0x7f4c6d848000
output: 0x7f4c7b918000
weight: 0x7f4ce2dff600
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f4e686dd2f2 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f4e686da67b in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f4e689351f9 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f4e686c53a4 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x7f4ebb5be8d9 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7f4ebb5b389a in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f4ebb5dab32 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f4ebaf17a86 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xa20e2f (0x7f4ebb5dde2f in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x369b90 (0x7f4ebaf26b90 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x36adfe (0x7f4ebaf27dfe in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /usr/bin/python3() [0x5d28f4]
frame #12: /usr/bin/python3() [0x5a729d]
frame #13: /usr/bin/python3() [0x5ec780]
frame #14: /usr/bin/python3() [0x5441f8]
frame #15: /usr/bin/python3() [0x54424a]
frame #16: PyDict_SetItemString + 0x536 (0x5d1686 in /usr/bin/python3)
frame #17: PyImport_Cleanup + 0x79 (0x684619 in /usr/bin/python3)
frame #18: Py_FinalizeEx + 0x7f (0x67f8af in /usr/bin/python3)
frame #19: Py_RunMain + 0x32d (0x6b70fd in /usr/bin/python3)
frame #20: Py_BytesMain + 0x2d (0x6b736d in /usr/bin/python3)
frame #21: __libc_start_main + 0xf3 (0x7f4ec081f0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: _start + 0x2e (0x5fa5ce in /usr/bin/python3)
Killing subprocess 133
Killing subprocess 134
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'main_pretrain.py', '--local_rank=1', '--batch_size', '16', '--accum_iter', '1', '--model', 'simmim_swin_tiny_256', '--input_size', '256', '--token_size',
'16', '--mask_ratio', '0.75', '--epochs', '200', '--warmup_epochs', '10', '--blr', '1.5e-4', '--weight_decay', '0.05', '--data_path', 's3://sky/datasets/imagenet/imagenet', '--dataloader_type', 'nori', '--output_dir', '/da
ta/code/pretrain/checkpoints/pretrain/', '--log_dir', '/data/code/pretrain/tb/simmim/', '--experiment', 'um_simmim_bs2048_ep200']' died with <Signals.SIGABRT: 6>.
While MAE with ViTs or PVTs are successfully trained, and when I tried to train SimMIM with Swin, this issue came up again.
Hi, thanks for your great work in combining MAE with hierarchical vision transformers!
However, when I tried to reproduce the results using your code, I encountered a CUDA error when training MAE with Swin. Here is a part of the log.
While MAE with ViTs or PVTs are successfully trained, and when I tried to train SimMIM with Swin, this issue came up again.