tusen-ai / SST

Code for a series of work in LiDAR perception, including SST (CVPR 22), FSD (NeurIPS 22), FSD++ (TPAMI 23), FSDv2, and CTRL (ICCV 23, oral).
Apache License 2.0
798 stars 101 forks source link

Asking for D2 Notation of session 3 in your paper #23

Closed seonhoon1002 closed 2 years ago

seonhoon1002 commented 2 years ago

Hello, I raise this issue to asking for some notation of your paper. In your paper, you note that "From D3 to D0, the set of strides of their four stages for each model are {1,2,4,8}, {1,2,4,4}, {1,2,2,2} and {1,1,1,1}, respectively" So I understood D notations such as: D3: [1,2,4,8] D2: [1,2,4,4] D1: [1,2,2,2] D0: [1,1,1,1] But you note that "PointPillar model D2", and I understood PointPillar model has [1,2,2,2] stride(following your noation, it is D1). So is it notation miss or I misunderstood the pointPillar?

thanks

Abyssaledge commented 2 years ago

Thanks for your interest. Before further discussion, may I ask where you find the PointPillars of strides [1, 2, 2, 2] ?

seonhoon1002 commented 2 years ago

image

I found it in PointPillar Paper and recheck in code in OpenPcdet pointpillar.yaml Like bellow:

"BACKBONE_2D: NAME: BaseBEVBackbone LAYER_NUMS: [3, 5, 5] LAYER_STRIDES: [2, 2, 2] NUM_FILTERS: [64, 128, 256] UPSAMPLE_STRIDES: [1, 2, 4] NUM_UPSAMPLE_FILTERS: [128, 128, 128] "

and mmdetection3D configuration too("hv_pointpillars_secfpn_kittti.py")

"model = dict( type='VoxelNet', voxel_layer=dict( max_num_points=32, # max_points_per_voxel point_cloud_range=[0, -39.68, -3, 69.12, 39.68, 1], voxel_size=voxel_size, max_voxels=(16000, 40000) # (training, testing) max_voxels ), voxel_encoder=dict( type='PillarFeatureNet', in_channels=4, feat_channels=[64], with_distance=False, voxel_size=voxel_size, point_cloud_range=[0, -39.68, -3, 69.12, 39.68, 1]), middle_encoder=dict( type='PointPillarsScatter', in_channels=64, output_shape=[496, 432]), backbone=dict( type='SECOND', in_channels=64, layer_nums=[3, 5, 5], layer_strides=[2, 2, 2], out_channels=[64, 128, 256]), "

Abyssaledge commented 2 years ago

The name layer_strides is a little bit misleading. I think layer_strides = [2, 2, 2] means downsampling feature maps 3 times with stride 2. So in terms of feature map sizes, the real network stride is [2, 4, 8]. And in Waymo Open Dataset, the PointPillars in MMDetection3D uses layer_strides=[1, 2, 2], where the feature maps only downsample 2 times. So in terms of feature map sizes, the real network stride is [1, 2, 4]. We split the last stage into two stages, so in our notations, the stride of D1 is [1, 2, 4, 4]. I hope I made it clear. Let me know if you have further questions.

seonhoon1002 commented 2 years ago

We may have the different view of "stride" term.

Thanks for nice anwser!