mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
https://torchsparse.mit.edu
MIT License
1.19k stars 139 forks source link

[BUG] "ValueError: The capacity of hashtable is not sufficient." in examples/backbone.py with version 2.1.0 #214

Closed 96lives closed 1 year ago

96lives commented 1 year ago

Is there an existing issue for this?

Current Behavior

Hi, I'm trying to get familiar with torchsparse 2.1.0, so I've ran the examples/backbone.py. I've changed the code so that the batch dimension is in the first dimension as mentioned in the docs.

import numpy as np
import torch
from torch import nn

from torchsparse import SparseTensor
from torchsparse.backbones import SparseResNet21D, SparseResUNet42
from torchsparse.utils.quantize import sparse_quantize

@torch.no_grad()
def main() -> None:
    device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

    for backbone in [SparseResNet21D, SparseResUNet42]:
        print(f'{backbone.__name__}:')
        model: nn.Module = backbone(in_channels=4, width_multiplier=1.0)
        model = model.to(device).eval()

        # generate data
        input_size, voxel_size = 1000, 0.2
        inputs = np.random.uniform(-100, 100, size=(input_size, 4))
        pcs, feats = inputs[:, :3], inputs
        pcs -= np.min(pcs, axis=0, keepdims=True)
        pcs, indices = sparse_quantize(pcs, voxel_size, return_index=True)
        coords = np.zeros((pcs.shape[0], 4))
        coords[:, -3:] = pcs[:, :3]
        coords[:, 0] = 0
        coords = torch.as_tensor(coords, dtype=torch.int)
        feats = torch.as_tensor(feats[indices], dtype=torch.float)
        input = SparseTensor(coords=coords, feats=feats).to(device)

        # forward
        outputs = model(input)

        # print feature shapes
        for k, output in enumerate(outputs):
            print(f'output[{k}].F.shape = {output.feats.shape}')

if __name__ == '__main__':
    main()

But when I run the code, I get an error that hashtable is not sufficient (as the following). It does not occur if I set the number of points to 100 (by changing input_size to 100) from 1000. Could you help me with this? I figure this must be a bug, cause 1000 points is heavily used in the literature. Thanks in advance :)

SparseResNet21D:
/home/ds/anaconda3/envs/gca-lightning/lib/python3.10/site-packages/torch/nn/modules/module.py:1130: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor
division, use torch.div(a, b, rounding_mode='floor').
  return forward_call(*input, **kwargs)
Traceback (most recent call last):
  File "/home/ds/gca-outdoor-deploy/torch_sparse_example.py", line 41, in <module>
    main()
  File "/home/ds/anaconda3/envs/gca-lightning/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ds/gca-outdoor-deploy/torch_sparse_example.py", line 33, in main
    outputs = model(input)
  File "/home/ds/anaconda3/envs/gca-lightning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "torchsparse/backbones/resnet.pyx", line 53, in torchsparse.backbones.resnet.SparseResNet.forward
  File "/home/ds/anaconda3/envs/gca-lightning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ds/anaconda3/envs/gca-lightning/lib/python3.10/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/ds/anaconda3/envs/gca-lightning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ds/anaconda3/envs/gca-lightning/lib/python3.10/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/ds/anaconda3/envs/gca-lightning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "torchsparse/nn/modules/conv.pyx", line 99, in torchsparse.nn.modules.conv.Conv3d.forward
  File "torchsparse/nn/functional/conv/conv.pyx", line 89, in torchsparse.nn.functional.conv.conv.conv3d
  File "torchsparse/nn/functional/conv/kmap/build_kmap.pyx", line 83, in torchsparse.nn.functional.conv.kmap.build_kmap.build_kernel_map
  File "torchsparse/nn/functional/conv/kmap/func/hashmap_on_the_fly.pyx", line 63, in torchsparse.nn.functional.conv.kmap.func.hashmap_on_the_fly.build_kmap_implicit_GEMM_hashmap_on_the_fly
ValueError: The capacity of hashtable is not sufficient.

Expected Behavior

No response

Environment

- GCC:8.4.0
- NVCC:11.2
- PyTorch: 1.12.1+cu113
- PyTorch CUDA: 11.3 
- TorchSparse: 2.1.0+torch112cu113

Anything else?

No response

96lives commented 1 year ago

One odd thing is that SparseResUNet42 seems to occur no errors while SparseResNet21D does...

kentang-mit commented 1 year ago

Hi @96lives,

Thanks for pointing out the issue. You may set the kmap_mode to hashmap using the following code snippet:

import torchsparse.nn.functional as F
F.set_kmap_mode("hashmap")

After this modification the error should go away. Our default hashmap construction method (hashmap_on_the_fly) is designed for large-scale inputs. The hashmap mode is slightly slower but is expected to also work well for small inputs. In fact, if the problem only has 1000 input points, sparse convolution does not provide a big advantage over point-based primitives.

Besides, SparseResNet21D is a detection backbone which uses kernel_size=3, stride=2 downsampling layers. These layers will dilate the activated regions in the input (we follow the definition of SpConv). SparseResUNet42, in contrast, is a segmentation backbone which only uses kernel_size=2, stride=2 downsampling layers. The activated regions will not dilate after each downsampling layer. So this is why you probably will see "insufficient hashtable capacity" when running SparseResNet21D because we currently construct the hash tables based on initial input sizes in hashmap_on_the_fly.

Best, Haotian

96lives commented 1 year ago

Thanks! It does work