[Bug] Mysterious Dimension Swapping in BEVFusion's TransfusionHead

barrydoooit commented 3 months ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

sys.platform: linux Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.4, V11.4.152 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 1.10.1+cu113 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.2+cu113 OpenCV: 4.10.0 MMEngine: 0.10.4 MMDetection: 3.2.0+d509b75

Reproduces the problem - code sample

bash tools/dist_train.sh [configs] 1

Reproduces the problem - command or script

bash tools/dist_train.sh [configs] 1

Reproduces the problem - error message

File "/usr/local/lib/python3.8/dist-packages/mmdet3d/models/detectors/base.py", line 75, in forward return self.loss(inputs, data_samples, kwargs) File "/workspace/bevfusion/bevfusion.py", line 301, in loss bbox_loss = self.bbox_head.loss(feats, batch_data_samples) File "/workspace/bevfusion/transfusion_head.py", line 761, in loss loss = self.loss_by_feat(preds_dicts, batch_gt_instances_3d) File "/workspace/bevfusion/transfusion_head.py", line 786, in loss_by_feat loss_heatmap = self.loss_heatmap( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.8/dist-packages/mmdet/models/losses/gaussian_focal_loss.py", line 176, in forward loss_reg = self.loss_weight gaussian_focal_loss( File "/usr/local/lib/python3.8/dist-packages/mmdet/models/losses/utils.py", line 121, in wrapper loss = loss_func(pred, target, kwargs) File "/usr/local/lib/python3.8/dist-packages/mmdet/models/losses/gaussian_focal_loss.py", line 35, in gaussian_focal_loss pos_loss = -(pred + eps).log() (1 - pred).pow(alpha) pos_weights RuntimeError: The size of tensor a (136) must match the size of tensor b (120) at non-singleton dimension 3

Additional information

Here I use a custom dataset with non-square bev feature (i.e., the sparse shape is [960, 1088, z], making the bev feature map of spatial shape [120, 136]). When passing the sparse shape as "grid_size", which is used in:https://github.com/open-mmlab/mmdetection3d/blob/fe25f7a51d36e3702f961e198894580d83c4387b/projects/BEVFusion/bevfusion/transfusion_head.py#L701-L702 and the feature_map_size is then used to create the heatmap. Here the X and Y dimensions are swapped, making it of spatial shape [136, 120] following the code below: https://github.com/open-mmlab/mmdetection3d/blob/fe25f7a51d36e3702f961e198894580d83c4387b/projects/BEVFusion/bevfusion/transfusion_head.py#L703-L704 The problem is finally triggered at https://github.com/open-mmlab/mmdetection3d/blob/fe25f7a51d36e3702f961e198894580d83c4387b/projects/BEVFusion/bevfusion/transfusion_head.py#L785-L790 (specifically in https://github.com/open-mmlab/mmdetection/blob/cfd5d3a985b0249de009b67d04f37263e11cdf3d/mmdet/models/losses/gaussian_focal_loss.py#L35) where the heatmap and the dense heatmap (with same spatial shape as the bev feature map) should have the same spatial shape. This problem won't happen with nuscenes since it has square bev feature and the swap means nothing. Conclusively, such swap when intializing the heatmap is ambiguous and makes it impossible to have the same shape as the bev feature.

barrydoooit commented 3 months ago

In addition, reverting this swapping, in other words, init-ing the heatmap as shape _feature_map_size[0], feature_mapsize[1] can accomplish to train a model.

cxnaive commented 3 months ago

Are you training a camera-only model?

barrydoooit commented 3 months ago

Are you training a camera-only model?

No, I’m dealing with Lidar-only and LC-fusion models. But this bug will remain there with cam-only BEVFusion, as the vtransform outputs a Bev feature map with the same spatial shape as the SCN.

cxnaive commented 3 months ago

In fact, this operation is present in the original code of transfusion head in BEVFusion. However, the mit-bevfusion and the BEVFusion from NeurIPS 2022 differ in the final step of vtransform. The outputs X and Y from their vtransform are reversed

barrydoooit commented 3 months ago

In fact, this operation is present in the original code of transfusion head in BEVFusion. However, the mit-bevfusion and the BEVFusion from NeurIPS 2022 differ in the final step of vtransform. The outputs X and Y from their vtransform are reversed

Thanks for pointing out. Yet what do you mean by "reversed"? Do you mean that the vtransform output is in a spatial shape of [y,x] rather than the [x, y] as the lidar feature from the SCN? I'm a little bit confused since if that's the case, the two feature maps won't be able to be stacked and processed by the 2D pts_backbone.

cxnaive commented 3 months ago

I'm confused about this too, but in the vtransform of the NeurIPS 2022 bevfusion, the output is [y,x], and in its (and BevDet's) transfusion head the position you mention is also [y,x].

cxnaive commented 3 months ago

Have you tried a non-square bev LCFusion? Does the 2d backbone of the bev model accept input properly?

barrydoooit commented 3 months ago

Have you tried a non-square bev LCFusion? Does the 2d backbone of the bev model accept input properly?

Yes, that's exactly the case I'm encountering. I have sparse_shape=[960, 1088, 41], which corresponds to x, y, z in lidar coord. The x, y, z bound of LSS is adjusted accordingly. In this case 2d backbone (pts_backbone) does accept the feature maps in a proper manner.

barrydoooit commented 3 months ago

@cxnaive Actually there is another confusing snippet, which however might be a hint to understand these ambiguous spatial shapes: https://github.com/open-mmlab/mmdetection3d/blob/fe25f7a51d36e3702f961e198894580d83c4387b/projects/BEVFusion/bevfusion/transfusion_head.py#L727-L731 Say sparse_shape (xyz) is [960, 1088 41]. The predicted center (which should be [x, y] as it is used to form a LidarInstance3DBbox) is reversed when used to index the hotspot in the heatmap. That means the heatmap should also be reversed (as it is now) , shaped as [1088/8, 960/8]. But the SCN outputs a feature map of [960/8, 1088/8] when using xyz voxelization, manifested in: https://github.com/open-mmlab/mmdetection3d/blob/fe25f7a51d36e3702f961e198894580d83c4387b/projects/BEVFusion/bevfusion/sparse_encoder.py#L144-L146 This is a conflict, but the operations in TransfusionHead are confusingly correct (except the heatmap coord swapping); using 'center' instead of 'center[[1,0]]' to index an heatmap of shape [960/8, 1088/8] ruins the training and the model never converges.

cxnaive commented 3 months ago

draw_heatmap_gaussian(heatmap[gt_labels_3d[idx]], center_int, radius) is the original version of transfusion head. The original version should correspond to BEV features in the format [Y, X], while center_int[[1,0]] corresponds to [X, Y]

cxnaive commented 3 months ago

So, the grid_size should also be reversed in the same way, but this was forgotten in the transfusion head of this version. Alternatively, consider using the original transfusion head but reversing the BEV features.

cxnaive commented 3 months ago

https://github.com/open-mmlab/mmdetection3d/blob/fe25f7a51d36e3702f961e198894580d83c4387b/mmdet3d/models/utils/gaussian.py#L46C5-L53C69 From this, it can be seen that the default heatmap shape in mmdet3d is [Y, X]

barrydoooit commented 3 months ago

https://github.com/open-mmlab/mmdetection3d/blob/fe25f7a51d36e3702f961e198894580d83c4387b/mmdet3d/models/utils/gaussian.py#L46C5-L53C69 From this, it can be seen that the default heatmap shape in mmdet3d is [Y, X]

This solves most of the confusion. That's why initializing the heatmap as shape feature_map_size[0], feature_map_size[1] (i.e.. the same as the bev feature) conforms to everything else, right?

cxnaive commented 3 months ago

Yes, you can check the implementation of CenterHead in CenterPoint within mmdet3D, which also uses [Y, X] for BEV features. However, the BEV features obtained from the sparse encoder in BEVFusion are [X, Y]

chenwen60 commented 3 months ago

![Uploading 企业微信截图_1724901329616.png…]()

open-mmlab / mmdetection3d