GPU memory issue when running the demo

xingruiyang commented 2 years ago

Hi,

I am facing a memory issue where CUDA complains about insufficient VRAM. I noticed this issue stems from line 245 of the file src/conv_onet/training.py where it tries to extract 3D features from the input point cloud.

My GPU has 8GiB VRAM but it seems the demo needs more than that. Can you tell me what's the minimum required VRAM size and perhaps there is a way to reduce the VRAM requirement? Thanks

Regards

tangjiapeng commented 2 years ago

Hi,

The GPU RAM when I run these experiments was at least 11 GB.

If you are using 8GB GPUs, you can reduce the batchsize from 16 to 12 or 8 to solve the memory issue.

You can also choose to reduce the number of input pointcloud for point feature learning.

Hope it can help you! Please let me know if you have addressed this problem!

Best, Jiapeng

xingruiyang commented 2 years ago

Thanks @tangjiapeng,

I can see the batch size in generate_optim_largescene.py is set to 1 so I don't think it's the issue with large batch size. I did try to downsample the point cloud as you suggested, I tried to set both pointcloud_n and pointcloud_subsample in demo_matterport.yaml to 4096, but I am still getting OOM errors. I am wondering if this is the correct way of downsampling input points? To help you diagnose this issue I have included the error message below:

Warning: generator does not support pointcloud generation.
  0%|                                                                                                                                                                                | 0/2 [00:00<?, ?it/s]Process scenes in a sliding-window manner
ft only encoder True
only optimize encoder████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 693/693 [01:44<00:00,  6.02it/s]
Traceback (most recent call last):
  File "generate_optim_largescene.py", line 235, in <module>
    loss = trainer.sign_agnostic_optim_cropscene_step(crop_data, state_dict)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/conv_onet/training.py", line 216, in sign_agnostic_optim_cropscene_step
    loss = self.compute_sign_agnostic_cropscene_loss(data)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/conv_onet/training.py", line 244, in compute_sign_agnostic_cropscene_loss
    c = self.model.encode_inputs(inputs)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/conv_onet/models/__init__.py", line 60, in encode_inputs
    c = self.encoder(inputs)
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/pointnet.py", line 307, in forward
    fea['grid'] = self.generate_grid_features(index['grid'], c)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/pointnet.py", line 262, in generate_grid_features
    fea_grid = self.unet3d(fea_grid)
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/unet3d.py", line 465, in forward
    x = decoder(encoder_features, x)
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/unet3d.py", line 284, in forward
    x = self.joining(encoder_features, x)
  File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/unet3d.py", line 291, in _joining
    return torch.cat((encoder_features, x), dim=1)
RuntimeError: CUDA out of memory. Tried to allocate 750.00 MiB (GPU 0; 7.79 GiB total capacity; 5.98 GiB already allocated; 446.81 MiB free; 6.03 GiB reserved in total by PyTorch)
Exception ignored in: <bound method tqdm.__del__ of   0%|                                                                                                                                                                                | 0/2 [01:50<?, ?it/s]>
Traceback (most recent call last):
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_tqdm.py", line 931, in __del__
    self.close()
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1133, in close
    self._decr_instances(self)
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_tqdm.py", line 496, in _decr_instances
    cls.monitor.exit()
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_monitor.py", line 52, in exit
    self.join()
  File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/threading.py", line 1053, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

tangjiapeng commented 2 years ago

The batch_size in demo_matterport.yaml was 2, you can set it to 1.

A better choice is to use GPUs with larger RAM.

tangjiapeng commented 2 years ago

Hi, xinrui, have you addressed the memory issue?

tangjiapeng / SA-ConvONet

GPU memory issue when running the demo #3