himashi92 / VT-UNet

[MICCAI2022] This is an official PyTorch implementation for A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation
MIT License
244 stars 32 forks source link

CUDA out of memory #38

Closed atisman89 closed 1 year ago

atisman89 commented 1 year ago

Hi, I'm now getting CUDA OOM error while training with small configuration: I saw other similar issues but not sure how this could be resolved in my situation. I'm using AWS EC2 p3.2xlarge instance (61GB RAM, 16GB GPU memory). Is "tiny" configuration still available? Then how can I use that configuration? Thanks.

  File "/home/VTUNet/vtunet/network_architecture/vtunet_tumor.py", line 359, in forward
    x, x2, v, k, q = self.forward_part1(x, mask_matrix, prev_v, prev_k, prev_q, is_decoder)
  File "/home/VTUNet/vtunet/network_architecture/vtunet_tumor.py", line 307, in forward_part1
    attn_windows, cross_attn_windows, v, k, q = self.attn(x_windows, mask=attn_mask, prev_v=prev_v, prev_k=prev_k,
  File "/opt/conda/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/VTUNet/vtunet/network_architecture/vtunet_tumor.py", line 197, in forward
    x2 = (attn2 @ prev_v).transpose(1, 2).reshape(B_, N, C)
RuntimeError: CUDA out of memory. Tried to allocate 216.00 MiB (GPU 0; 15.77 GiB total capacity; 13.91 GiB already allocated; 166.12 MiB free; 14.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception in thread Thread-5:
...
atisman89 commented 1 year ago

Resolved after changing the default batch size to 1 in /home/VTUNet/vtunet/run/default_configuration.py

    elif task == 'Task003_tumor':
        print("Task Tumor here we go !!!")
        plans['plans_per_stage'][0]['batch_size'] = 1

Original value was 4. Changing it to 2 didn't work either for 16GB GPU memory...