Closed FrancescoMandru closed 3 years ago
Could you remove .to(device)
and profile the number of parameters of the model in the CPU?
Actually not as the model seems to be loaded in the GPU regardless of the type of the device, which is somewhat related with the issue opened on torchsparse library. I attach the attempt of your request:
Downloading: "https://hanlab.mit.edu/files/SPVNAS/spvnas_specialized/SemanticKITTI_val_SPVNAS@20GMACs/net.config" to .torch/spvnas_specialized/SemanticKITTI_val_SPVNAS@20GMACs/net.config
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-14-d842a8f56b01> in <module>()
1 # import SPVNAS model from model zoo
2 from model_zoo import spvnas_specialized, spvcnn
----> 3 model = spvnas_specialized('SemanticKITTI_val_SPVNAS@20GMACs')
4 print(model.parameters)
5 model.eval()
18 frames
/usr/local/lib/python3.7/dist-packages/torchsparse/nn/functional/conv.py in forward(ctx, features, kernel, neighbor_map, neighbor_offset, sizes, transpose)
38 torchsparse_backend.sparseconv_forward(features, out, kernel,
39 neighbor_map,
---> 40 neighbor_offset, transpose)
41 else:
42 # use the native pytorch XLA APIs for the TPU.
RuntimeError: CUDA out of memory. Tried to allocate 66.83 GiB (GPU 0; 14.76 GiB total capacity; 187.02 MiB already allocated; 13.53 GiB free; 202.00 MiB reserved in total by PyTorch)
Moreover if I try to run the same cell different times I get different output errors:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-17-efa215b6e51b> in <module>()
1 # import SPVNAS model from model zoo
2 from model_zoo import spvnas_specialized, spvcnn
----> 3 model = spvnas_specialized('SemanticKITTI_val_SPVNAS@20GMACs')
4 print(model.parameters())
5 model.eval()
18 frames
/usr/local/lib/python3.7/dist-packages/torchsparse/nn/functional/conv.py in forward(ctx, features, kernel, neighbor_map, neighbor_offset, sizes, transpose)
38 torchsparse_backend.sparseconv_forward(features, out, kernel,
39 neighbor_map,
---> 40 neighbor_offset, transpose)
41 else:
42 # use the native pytorch XLA APIs for the TPU.
RuntimeError: at::cuda::blas::gemm<float> argument n must be non-negative and less than 2147483647 but got -1332474640
or
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-18-efa215b6e51b> in <module>()
1 # import SPVNAS model from model zoo
2 from model_zoo import spvnas_specialized, spvcnn
----> 3 model = spvnas_specialized('SemanticKITTI_val_SPVNAS@20GMACs')
4 print(model.parameters())
5 model.eval()
5 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in convert(t)
669 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
670 non_blocking, memory_format=convert_to_format)
--> 671 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
672
673 return self._apply(convert)
RuntimeError: CUDA error: an illegal memory access was encountered
I think they are all related with overflow in CUDA memory
I am having the exact same problem when running SPVNAS/Tutorial in COLAB, with GPU backend. When I run the commands:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-18-f5ea08b4d2a2> in <module>()
1 # import SPVNAS model from model zoo
2 from model_zoo import spvnas_specialized
----> 3 model = spvnas_specialized('SemanticKITTI_val_SPVNAS@20GMACs')
4 # model = spvnas_specialized('SemanticKITTI_val_SPVNAS@65GMACs').to(device)
5
18 frames /usr/local/lib/python3.7/dist-packages/torchsparse/nn/functional/conv.py in forward(ctx, features, kernel, neighbor_map, neighbor_offset, sizes, transpose) 38 torchsparse_backend.sparseconv_forward(features, out, kernel, 39 neighbor_map, ---> 40 neighbor_offset, transpose) 41 else: 42 # use the native pytorch XLA APIs for the TPU.
RuntimeError: CUDA out of memory. Tried to allocate 115.36 GiB (GPU 0; 14.76 GiB total capacity; 754.88 MiB already allocated; 12.97 GiB free; 770.00 MiB reserved in total by PyTorch)
Could this error be related to the version of torchsparse library installed?
Thank you!
I have tried to follow the same tutorial on my Desktop PC. I have used a conda env with Pythorch 1.8.1 and CUDA 11.1. Still the same error pops up in the line model = spvnas_specialized('SemanticKITTI_val_SPVNAS@65GMACs').to(device). I am using a NVidia RTX 2060 Super 8GB RAM so I guess the model should fit in there.
Any suggestion @zhijian-liu ?
That's really strange. I cannot reproduce the error from my local machine (I'm using Torch 1.8.1 and the latest TorchSparse). It allocates around 1.4G of GPU memory. I will try to validate the issue on Google Colab.
I've updated the tutorial to fix the issue: https://colab.research.google.com/github/mit-han-lab/spvnas/blob/master/tutorial.ipynb. Please let me know if the updated version works for you.
Thank you @zhijian-liu. With your changes I can now run successfully your tutorial both in Colab and in my local machine!
I'm trying to run the
example.py
with colab and when I load a pre-trained model as:model = spvnas_specialized('SemanticKITTI_val_SPVNAS@20GMACs').to(device)
I get the following out of memory which is strange as I think that the network actually does not have so many parameters:
RuntimeError: CUDA out of memory. Tried to allocate 127.56 GiB (GPU 0; 14.76 GiB total capacity; 1.25 GiB already allocated; 12.37 GiB free; 1.36 GiB reserved in total by PyTorch)
Also I want to notify you that I opened an issue in the torch sparse repository as there are some problems running test on CPU with Docker environment configured as you request.