Closed miaozhang97bosch closed 1 month ago
Hi Miao, This is a valid point. We experimented with various encoding schemes but ultimately converged on VoxelNet due to its superior performance. While PointPillars might offer better results, we currently have no evidence to support either outcome. I cannot recall that we experimented with PointPillars, but you are free to test it out.
Recent papers on BEV segmentation based on multimodal data tend to use VoxelNet over PointPillars. While it's hard to argue for a preference here, my guess is that this is due to the usage of 3D convolutions in VoxelNet, rather than generating pseudo-images from point clouds as PointPillars does. One such work is BEVFusion. While the use of 3D convolutions for data in the BEV space (as for radar) might not seem very intuitive, please be aware, that VoxelNet has additional layers such as convolutional middle layers and a region proposal network that may offer an additional advantage.
By the way, we do not use a predefined voxel grid in our work, which may contain empty voxels. Instead, we voxelize the radar data and use a dense representation to avoid sparsity. There will be no issues since we use a dense voxel grid after voxelization. So to say, if you input a representation that has barely any height dimension, the output will reflect this. The radar features are expected to have a shape of B,C,H,W, where typically C=128, H=W=200 (BEV image size).
Hi Miao, This is a valid point. We experimented with various encoding schemes but ultimately converged on VoxelNet due to its superior performance. While PointPillars might offer better results, we currently have no evidence to support either outcome. I cannot recall that we experimented with PointPillars, but you are free to test it out.
Recent papers on BEV segmentation based on multimodal data tend to use VoxelNet over PointPillars. While it's hard to argue for a preference here, my guess is that this is due to the usage of 3D convolutions in VoxelNet, rather than generating pseudo-images from point clouds as PointPillars does. One such work is BEVFusion. While the use of 3D convolutions for data in the BEV space (as for radar) might not seem very intuitive, please be aware, that VoxelNet has additional layers such as convolutional middle layers and a region proposal network that may offer an additional advantage.
By the way, we do not use a predefined voxel grid in our work, which may contain empty voxels. Instead, we voxelize the radar data and use a dense representation to avoid sparsity. There will be no issues since we use a dense voxel grid after voxelization. So to say, if you input a representation that has barely any height dimension, the output will reflect this. The radar features are expected to have a shape of B,C,H,W, where typically C=128, H=W=200 (BEV image size).
Thanks for your reply. I think recent papers you referred to are working mostly with lidar point cloud or point cloud from 4D radar sensor with the elevation angle information. But you are right, dense representation can reflect that during training. During your query initialization, I assume that it should benefit more with the real 4d radar dataset, such as View of Delft or TJ4RADAR.
Hello, Thanks for sharing this interesting work. I kind of wonder the meaning of using voxel rather than pillar in the radar encoding part. As I know, the radar point cloud of NuScenes doesn't really contain any real height information of the object, the z value of the point is actually derived from the relative pose of the radar sensor, which means you will only get one point at each certain (x,y) value, so in that case, if voxel is utilized, most of the voxel will be empty, since there is no point there. And also very limited 3d info can be extracted from that. I'll appreciate it if you can give me some hint about it or some ablation study of different encoding structures. Looking forward to your reply!