CUDA error for Swin models

Hi, thanks for your great work in combining MAE with hierarchical vision transformers!
However, when I tried to reproduce the results using your code, I encountered a CUDA error when training MAE with Swin. Here is a part of the log.
[09:52:59.848646] base lr: 1.50e-04
[09:52:59.848669] actual lr: 1.87e-05
[09:52:59.848696] accumulate grad iterations: 1
[09:52:59.848714] effective batch size: 32
[09:52:59.905504] AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.95)
    eps: 1e-08
    lr: 1.875e-05
    weight_decay: 0.0

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.95)
    eps: 1e-08
    lr: 1.875e-05
    weight_decay: 0.05
)
[09:52:59.906380] Checkpoint not founded in /data/code/pretrain/checkpoints/pretrain/simmim_swin_tiny_256_um_simmim_bs2048_ep200_temp.pth, train from random initialization
[09:52:59.906428] Start training for 200 epochs
[09:52:59.907566] log_dir: /data/code/pretrain/tb/simmim/simmim_swin_tiny_256_um_simmim_bs2048_ep200
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can a
dversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have
 unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can a
dversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have
 unused parameters. (function operator())
Traceback (most recent call last):
  File "main_pretrain.py", line 380, in <module>
    main(args)
  File "main_pretrain.py", line 266, in main
    train_stats = train_one_epoch(
  File "/home/wanghaochen/project/UM-MAE/engine_pretrain.py", line 58, in train_one_epoch
    loss_scaler(loss, optimizer, parameters=model.parameters(),
  File "/home/wanghaochen/project/UM-MAE/util/misc.py", line 256, in __call__
    self._scaler.scale(loss).backward(create_graph=create_graph)
  File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 32, 64, 64], dtype=torch.half, device='cuda', requires_grad=True).to(memory_format=torch.channels_last)
net = torch.nn.Conv2d(32, 24, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half().to(memory_format=torch.channels_last)
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x7f4c4403d690
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 1, 32, 64, 64,
    strideA = 131072, 1, 2048, 32,
output: TensorDescriptor 0x7f4c4403d2b0
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 1, 24, 64, 64,
    strideA = 98304, 1, 1536, 24,
weight: FilterDescriptor 0x7f4c440688f0
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NHWC
    nbDims = 4
    dimA = 24, 32, 1, 1,
Pointer addresses:
    input: 0x7f4c6d848000
    output: 0x7f4c7b918000
    weight: 0x7f4ce2dff600

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f4e686dd2f2 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f4e686da67b in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f4e689351f9 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f4e686c53a4 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x7f4ebb5be8d9 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7f4ebb5b389a in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f4ebb5dab32 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f4ebaf17a86 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xa20e2f (0x7f4ebb5dde2f in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x369b90 (0x7f4ebaf26b90 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x36adfe (0x7f4ebaf27dfe in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /usr/bin/python3() [0x5d28f4]
frame #12: /usr/bin/python3() [0x5a729d]
frame #13: /usr/bin/python3() [0x5ec780]
frame #14: /usr/bin/python3() [0x5441f8]
frame #15: /usr/bin/python3() [0x54424a]
frame #16: PyDict_SetItemString + 0x536 (0x5d1686 in /usr/bin/python3)
frame #17: PyImport_Cleanup + 0x79 (0x684619 in /usr/bin/python3)
frame #18: Py_FinalizeEx + 0x7f (0x67f8af in /usr/bin/python3)
frame #19: Py_RunMain + 0x32d (0x6b70fd in /usr/bin/python3)
frame #20: Py_BytesMain + 0x2d (0x6b736d in /usr/bin/python3)
frame #21: __libc_start_main + 0xf3 (0x7f4ec081f0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: _start + 0x2e (0x5fa5ce in /usr/bin/python3)
Killing subprocess 133
Killing subprocess 134
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'main_pretrain.py', '--local_rank=1', '--batch_size', '16', '--accum_iter', '1', '--model', 'simmim_swin_tiny_256', '--input_size', '256', '--token_size',
'16', '--mask_ratio', '0.75', '--epochs', '200', '--warmup_epochs', '10', '--blr', '1.5e-4', '--weight_decay', '0.05', '--data_path', 's3://sky/datasets/imagenet/imagenet', '--dataloader_type', 'nori', '--output_dir', '/da
ta/code/pretrain/checkpoints/pretrain/', '--log_dir', '/data/code/pretrain/tb/simmim/', '--experiment', 'um_simmim_bs2048_ep200']' died with <Signals.SIGABRT: 6>.
While MAE with ViTs or PVTs are successfully trained, and when I tried to train SimMIM with Swin, this issue came up again.
implus / UM-MAE

CUDA error for Swin models #9