mit-han-lab / spvnas

[ECCV 2020] Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution
http://spvnas.mit.edu/
MIT License
580 stars 109 forks source link

Cuda out of memory #52

Closed FrancescoMandru closed 3 years ago

FrancescoMandru commented 3 years ago

I'm trying to run the example.py with colab and when I load a pre-trained model as:

model = spvnas_specialized('SemanticKITTI_val_SPVNAS@20GMACs').to(device)

I get the following out of memory which is strange as I think that the network actually does not have so many parameters:

RuntimeError: CUDA out of memory. Tried to allocate 127.56 GiB (GPU 0; 14.76 GiB total capacity; 1.25 GiB already allocated; 12.37 GiB free; 1.36 GiB reserved in total by PyTorch)

Also I want to notify you that I opened an issue in the torch sparse repository as there are some problems running test on CPU with Docker environment configured as you request.

zhijian-liu commented 3 years ago

Could you remove .to(device) and profile the number of parameters of the model in the CPU?

FrancescoMandru commented 3 years ago

Actually not as the model seems to be loaded in the GPU regardless of the type of the device, which is somewhat related with the issue opened on torchsparse library. I attach the attempt of your request:

Downloading: "https://hanlab.mit.edu/files/SPVNAS/spvnas_specialized/SemanticKITTI_val_SPVNAS@20GMACs/net.config" to .torch/spvnas_specialized/SemanticKITTI_val_SPVNAS@20GMACs/net.config
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-14-d842a8f56b01> in <module>()
      1 # import SPVNAS model from model zoo
      2 from model_zoo import spvnas_specialized, spvcnn
----> 3 model = spvnas_specialized('SemanticKITTI_val_SPVNAS@20GMACs')
      4 print(model.parameters)
      5 model.eval()

18 frames
/usr/local/lib/python3.7/dist-packages/torchsparse/nn/functional/conv.py in forward(ctx, features, kernel, neighbor_map, neighbor_offset, sizes, transpose)
     38             torchsparse_backend.sparseconv_forward(features, out, kernel,
     39                                                    neighbor_map,
---> 40                                                    neighbor_offset, transpose)
     41         else:
     42             # use the native pytorch XLA APIs for the TPU.

RuntimeError: CUDA out of memory. Tried to allocate 66.83 GiB (GPU 0; 14.76 GiB total capacity; 187.02 MiB already allocated; 13.53 GiB free; 202.00 MiB reserved in total by PyTorch)

Moreover if I try to run the same cell different times I get different output errors:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-17-efa215b6e51b> in <module>()
      1 # import SPVNAS model from model zoo
      2 from model_zoo import spvnas_specialized, spvcnn
----> 3 model = spvnas_specialized('SemanticKITTI_val_SPVNAS@20GMACs')
      4 print(model.parameters())
      5 model.eval()

18 frames
/usr/local/lib/python3.7/dist-packages/torchsparse/nn/functional/conv.py in forward(ctx, features, kernel, neighbor_map, neighbor_offset, sizes, transpose)
     38             torchsparse_backend.sparseconv_forward(features, out, kernel,
     39                                                    neighbor_map,
---> 40                                                    neighbor_offset, transpose)
     41         else:
     42             # use the native pytorch XLA APIs for the TPU.

RuntimeError: at::cuda::blas::gemm<float> argument n must be non-negative and less than 2147483647 but got -1332474640

or

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-18-efa215b6e51b> in <module>()
      1 # import SPVNAS model from model zoo
      2 from model_zoo import spvnas_specialized, spvcnn
----> 3 model = spvnas_specialized('SemanticKITTI_val_SPVNAS@20GMACs')
      4 print(model.parameters())
      5 model.eval()

5 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in convert(t)
    669                 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
    670                             non_blocking, memory_format=convert_to_format)
--> 671             return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
    672 
    673         return self._apply(convert)

RuntimeError: CUDA error: an illegal memory access was encountered

I think they are all related with overflow in CUDA memory

luiscastro1995 commented 3 years ago

I am having the exact same problem when running SPVNAS/Tutorial in COLAB, with GPU backend. When I run the commands:

18 frames /usr/local/lib/python3.7/dist-packages/torchsparse/nn/functional/conv.py in forward(ctx, features, kernel, neighbor_map, neighbor_offset, sizes, transpose) 38 torchsparse_backend.sparseconv_forward(features, out, kernel, 39 neighbor_map, ---> 40 neighbor_offset, transpose) 41 else: 42 # use the native pytorch XLA APIs for the TPU.

RuntimeError: CUDA out of memory. Tried to allocate 115.36 GiB (GPU 0; 14.76 GiB total capacity; 754.88 MiB already allocated; 12.97 GiB free; 770.00 MiB reserved in total by PyTorch)


Could this error be related to the version of torchsparse library installed?

Thank you!
luiscastro1995 commented 3 years ago

I have tried to follow the same tutorial on my Desktop PC. I have used a conda env with Pythorch 1.8.1 and CUDA 11.1. Still the same error pops up in the line model = spvnas_specialized('SemanticKITTI_val_SPVNAS@65GMACs').to(device). I am using a NVidia RTX 2060 Super 8GB RAM so I guess the model should fit in there.

Any suggestion @zhijian-liu ?

zhijian-liu commented 3 years ago

That's really strange. I cannot reproduce the error from my local machine (I'm using Torch 1.8.1 and the latest TorchSparse). It allocates around 1.4G of GPU memory. I will try to validate the issue on Google Colab.

zhijian-liu commented 3 years ago

I've updated the tutorial to fix the issue: https://colab.research.google.com/github/mit-han-lab/spvnas/blob/master/tutorial.ipynb. Please let me know if the updated version works for you.

luiscastro1995 commented 3 years ago

Thank you @zhijian-liu. With your changes I can now run successfully your tutorial both in Colab and in my local machine!