tianweiy / CenterPoint

MIT License
1.9k stars 458 forks source link

Error when training with dcn_head #163

Closed zwbai closed 3 years ago

zwbai commented 3 years ago

Hi @tianweiy,

Thanks for your great work! I am now trying to run the train.py but get the similar error like #120 . The error is shown as follow: Traceback (most recent call last): File "tools/train.py", line 137, in <module> main() File "tools/train.py", line 132, in main logger=logger, File "/home/zwbai/Documents/CMM_Tracking/CenterPoint/det3d/torchie/apis/train.py", line 327, in train_detector trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank) File "/home/zwbai/Documents/CMM_Tracking/CenterPoint/det3d/torchie/trainer/trainer.py", line 543, in run epoch_runner(data_loaders[i], self.epoch, **kwargs) File "/home/zwbai/Documents/CMM_Tracking/CenterPoint/det3d/torchie/trainer/trainer.py", line 418, in train self.call_hook("after_train_iter") File "/home/zwbai/Documents/CMM_Tracking/CenterPoint/det3d/torchie/trainer/trainer.py", line 331, in call_hook getattr(hook, fn_name)(self) File "/home/zwbai/Documents/CMM_Tracking/CenterPoint/det3d/torchie/trainer/hooks/optimizer.py", line 18, in after_train_iter trainer.outputs["loss"].backward() File "/home/zwbai/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/zwbai/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/autograd/__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag File "/home/zwbai/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/autograd/function.py", line 89, in apply return self._forward_cls.backward(self, *args) # type: ignore File "/home/zwbai/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/autograd/function.py", line 210, in wrapper outputs = fn(ctx, *args) File "/home/zwbai/Documents/CMM_Tracking/CenterPoint/det3d/ops/dcn/deform_conv.py", line 93, in backward cur_im2col_step) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead. My environment is Pytorch 1.8.1 + CUDA 11.1 + Cudnn 8.0.5

BTW, I also changed all the .view() to .reshape() but still get the error. I find that if I set dcn_head = False or the batchsize = 1, then this error will disappear, which I don't know why. I guess this may due to the dcn code, but I don't know exactly why and how to fix it? So could you please give some suggestions?

tianweiy commented 3 years ago

It is probably related to dcn code which I am also not sure how to debug or fix. You can try the version without dcn, this new config is actually already much better than the original dcn number reported in the paper https://github.com/tianweiy/CenterPoint/blob/master/configs/nusc/README.md#voxelnet