mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
https://torchsparse.mit.edu
MIT License
1.19k stars 139 forks source link

[BUG] When calling model twice on same input, dimensions of output features do not match output coords #196

Closed fishbotics closed 1 year ago

fishbotics commented 1 year ago

Is there an existing issue for this?

Current Behavior

Hi,

I am running into a very strange issue that I don't know how to replicate super well. I've made a super simple model and when I call it once on the input, the results make sense. When I call it a second time (on the original input), the length of the coords are wrong, while the length of the feats are not. This means I cannot, for example, use GlobalAveragePool because the coordinates do not match the features.

│In [33]: class WeirdBehavior(nn.Module):
│    ...:     def __init__(self):
│    ...:         super().__init__()
│    ...:         self.conv = spnn.Conv3d(
│    ...:             4, 32, kernel_size=(3, 3, 3), stride=(2, 2, 2), bias=False
│    ...:         )
│    ...: 
│    ...:     def forward(self, pc):
│    ...:         print("Input C shape:", pc.C.shape, ", Input F shape:", pc.F.shape)
│    ...:         x = self.conv(pc)
│    ...:         print("Output C shape:", x.C.shape, ", Output F shape:", x.F.shape)
│    ...:         return x
│    ...: 
│
│In [34]: mdl = WeirdBehavior().cuda()
│
│In [35]: for b in dm.train_dataloader():
│    ...:     pc = b['sparse_pc'].cuda(); break
│    ...: 
│
│In [36]: mdl(pc)
│Input C shape: torch.Size([123359, 4]) , Input F shape: torch.Size([123359, 4])
│Output C shape: torch.Size([288006, 4]) , Output F shape: torch.Size([288006, 32])
│Out[36]: <torchsparse.tensor.SparseTensor at 0x7fcd69851d00>
│
│In [37]: mdl(pc)
│Input C shape: torch.Size([123359, 4]) , Input F shape: torch.Size([123359, 4])
│Output C shape: torch.Size([123359, 4]) , Output F shape: torch.Size([288006, 32])
│Out[37]: <torchsparse.tensor.SparseTensor at 0x7fcd5208b190>
│

One thing I noticed is that the cmaps and kmaps of the input variable pc are being modified (I checked by doing a deep copy and then . My guess is that in the __add__ function, these are being assigned by reference to another tensor and then modified on the other tensor (which in turns modifies them on the input tensor). Just to show that:

In [65]: pc.kmaps
Out[65]: {}

In [66]: pc.cmaps
Out[66]: {}

In [67]: for b in dm.train_dataloader():
    ...:     pc = b['sparse_pc'].cuda(); break
    ...: 

In [68]: print(pc.kmaps, pc.cmaps)
{} {}

In [69]: mdl(pc)
Input C shape: torch.Size([123259, 4]) , Input F shape: torch.Size([123259, 4])
Output C shape: torch.Size([288215, 4]) , Output F shape: torch.Size([288215, 32])
Out[69]: <torchsparse.tensor.SparseTensor at 0x7fcd692bfc40>

In [70]: print(pc.kmaps, pc.cmaps)
{((1, 1, 1), (3, 3, 3), (2, 2, 2), (1, 1, 1)): [tensor([[    96,    223],
        [   102,    277],
        [   129,    361],
        ...,
        [122891, 287398],
        [122999, 287659],
        [123194, 288067]], device='cuda:0'), tensor([ 6520,  6675,  6520,  6342,  6721,  6342,  6520,  6675,  6520, 24047,
        24658, 24047, 24011, 24285, 24011, 24047, 24658, 24047,  6520,  6675,
         6520,  6342,  6721,  6342,  6520,  6675,  6520], device='cuda:0'), (123259, 288215)]} {(2, 2, 2): tensor([[  0,  64,   4,   0],
        [  0,  66,   4,   0],
        [  0,  76,   0,   0],
        ...,
        [318, 424,   0,  19],
        [318, 426,   0,  19],
        [318, 426,   2,  19]], device='cuda:0', dtype=torch.int32)}

Expected Behavior

I expect the length of the coords to always match the length of the feats

Environment

- GCC: 9.4.0
- NVCC: V11.3.109
- PyTorch: 1.12.0+cu113
- PyTorch CUDA: 11.3
- TorchSparse: 1.4.0

Anything else?

No response

fishbotics commented 1 year ago

I'm realizing that these tensors are state-ful (unlike traditional Pytorch tensors). I'll leave this to the developers to close, but this is a non-issue. Perhaps this could go in the README?

zhijian-liu commented 1 year ago

Thanks for reporting this issue. Yes, sparse tensors are stateful, which means you need to reset the cache in the second run.

fishbotics commented 1 year ago

@zhijian-liu : what's the right way to reset the cache?

zhijian-liu commented 1 year ago
pc.cmaps.clear()
pc.kmaps.clear()