Closed baibaidj closed 2 years ago
related issue: https://github.com/pytorch/pytorch/issues/48945
RuntimeError: CUDA error: an illegal memory access was encountered
This error when converting Conv2d to Conv3d is, in my experience, another version of OOM. Appears when trying to allocate a tensor that is larger than the GPU's total memory size. Test by reducing the batch size or the number of channels in the model.
Also make sure torch.backends.cudnn.enabled
is enabled
conv3d memory issue
Checklist
Describe the Issue I'm training a 3D segmentation network for organs using UPernet 3D with Conv3D, and after starting training for a few iterations, an CUDA error ("an illegal memory access was encountered") is invoked. I expect that it can keep training without this error.
Reproduction
model settings
conv_cfg = dict(type = 'Conv3d') norm_cfg = dict(type='BN3d', requires_grad=True) #Sync base_channels = 24 # bs2, chn24 16G; bs2, chn48 failed due to illegal memory access fpn_chn = int(512 base_channels/96) model = dict( type='EncoderDecoderMonai', pretrained= None, backbone=dict( type='SwinTransformer3d', in_chans=1, embed_dim=base_channels, depths=[2, 2, 6, 2], # 2,4,12,4: bs2, 21G num_heads=[3, 6, 12, 24], window_size=7, mlp_ratio=4., qkv_bias=True, qk_scale=None, drop_rate=0., attn_drop_rate=0., drop_path_rate=0.3, ape=False, patch_norm=True, out_indices=(0, 1, 2, 3), use_checkpoint=False), neck=dict( type='UPerNeck3D', in_channels=[base_channels (2* i) for i in range(4)], in_index=[0, 1, 2, 3], pool_scales=(1, 2, 3, 6), channels= fpn_chn, conv_cfg = conv_cfg, norm_cfg=norm_cfg, align_corners=False,), decode_head=dict( type='FCNHead3D', in_channels=[fpn_chn] 4, in_index=(0, 1, 2, 3), channels=fpn_chn, input_transform='resize_concat', kernel_size=3, num_convs=1, concat_input=False, dropout_ratio=0.1, num_classes=num_classes, conv_cfg = conv_cfg, norm_cfg=norm_cfg, align_corners=False, verbose = False, loss_decode =dict( type='ComboLossMed', loss_weight=(1.0, 0.6), num_classes = num_classes, class_weight = (0.8, 1.1, 1.0, 1.0), verbose = False, ), ), auxiliary_head=dict( type='FCNHead3D', in_channels=fpn_chn, in_index=0, channels=fpn_chn//2, num_convs=1, concat_input=False, dropout_ratio=0.1, num_classes=num_classes, conv_cfg = conv_cfg, norm_cfg=norm_cfg, align_corners=False, loss_decode =dict(
type='ComboLossMed', loss_weight=(1.0 0.4, 0.6 0.4), num_classes = num_classes, class_weight = (0.8, 1.1, 1.0, 1.0), verbose = False ), ),
model training and testing settings
optimizer
AdamW optimizer, no weight decay for position embedding & layer norm in backbone
optimizer = dict(type='AdamW', lr=0.0001, betas=(0.9, 0.999), weight_decay=0.001, paramwise_cfg=dict(custom_keys={'absolute_pos_embed': dict(decay_mult=0.), 'relative_position_bias_table': dict(decay_mult=0.), 'norm': dict(decay_mult=0.)})) lr_config = dict(policy='poly', warmup='linear', warmup_iters=1500, warmup_ratio=1e-6, power=1.0, min_lr=0.0, by_epoch=False) optimizer_config = dict(type='Fp16OptimizerHook', loss_scale=512., distributed = False, grad_clip = dict(max_norm = 8, norm_type = 2))
Bug fix not yet.