turandai / gaussian_surfels

Implementation of the SIGGRAPH 2024 conference paper "High-quality Surface Reconstruction using Gaussian Surfels".
540 stars 26 forks source link

CUDA out of memory during training #41

Closed ahegd closed 2 months ago

ahegd commented 3 months ago

Hi, I am training on AWS EC2 g4dn.12xlarge instance which has 4 GPUs each with 16GB memory. I get the CUDA out of memory error while training 100 images of resolution 3071 × 1919. Please advise.

Estimating normal:   0%|                                                                                                                       | 0/100 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "estimate_normal.py", line 147, in <module>
    save_outputs(f, os.path.splitext(os.path.basename(f))[0])
  File "/home/ubuntu/miniconda3/envs/gaussian_surfels/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "estimate_normal.py", line 111, in save_outputs
    output = model(img_tensor).clamp(min=0, max=1)
  File "/home/ubuntu/miniconda3/envs/gaussian_surfels/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/codebases/gaussian_surfels/submodules/omnidata/modules/midas/dpt_depth.py", line 107, in forward
    return super().forward(x).squeeze(dim=1)
  File "/home/ubuntu/codebases/gaussian_surfels/submodules/omnidata/modules/midas/dpt_depth.py", line 71, in forward
    layer_1, layer_2, layer_3, layer_4 = forward_vit(self.pretrained, x)
  File "/home/ubuntu/codebases/gaussian_surfels/submodules/omnidata/modules/midas/vit.py", line 64, in forward_vit
    glob = pretrained.model.forward_flex(x)
  File "/home/ubuntu/codebases/gaussian_surfels/submodules/omnidata/modules/midas/vit.py", line 151, in forward_flex
    x = blk(x)
  File "/home/ubuntu/miniconda3/envs/gaussian_surfels/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/gaussian_surfels/lib/python3.7/site-packages/timm/models/vision_transformer.py", line 164, in forward
    x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
  File "/home/ubuntu/miniconda3/envs/gaussian_surfels/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/gaussian_surfels/lib/python3.7/site-packages/timm/models/vision_transformer.py", line 97, in forward
    attn = q @ k.transpose(-2, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.73 GiB (GPU 0; 14.57 GiB total capacity; 3.33 GiB already allocated; 10.63 GiB free; 3.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
turandai commented 3 months ago

Hi, hundreds of high-res images could lead to OOM, try to downsample the images or use less images. Our method usually use more runtime memory than original 3DGS due to normal estimation and additional gradient graph.