junjie18 / CMT

[ICCV 2023] Cross Modal Transformer: Towards Fast and Robust 3D Object Detection
Other
330 stars 37 forks source link

performance on waymo dataset #27

Open lcc815 opened 1 year ago

lcc815 commented 1 year ago

hi authors,

I am curious about the performance of the model on waymo dataset, but this was not mentioned in the paper. May I ask if you have conducted any relevant experiments and what were the results?

Thanks

junjie18 commented 1 year ago

Sorry, I have made some simple attempts on Waymo dataset, but haven't obtain an outstanding results on Waymo dataset now. The experiments are done in LiDAR-only condition, since I haven't found a suitable hyperparameter setting for DETR-like head. Something interesting, we have conduct experiments using our private dataset, when dataset becomes larger, CMT head gains better results than Centerpoint head , even in LiDAR-only setting.

lcc815 commented 1 year ago

quite interesting. Any idea about this phenomenon?

junjie18 commented 1 year ago

I do not know why, but I think this may give us a chance in scaling up 3D perception models. Larger data and larger backbone. The model infrastructure is totally transformer layers, there are many mature model parallel techniques like PP, TP, FSDP to use. In camera-only 3D detection, models using VIT now obtains SoTA performance. In LiDAR-only 3D detection, I have made some attempt in point cloud transformer architecture before, with each voxel as a token, naive transformer backbone + DETR head obtains 52% mAP, which is lower than 'centerpoint' with voxelnet + second 58%. Maybe there still remains some problem to be solved.