Why I found Lidar-only segmentation performs better than fusion segmentation?

mit-han-lab / bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

https://bevfusion.mit.edu

Apache License 2.0

2.26k stars 409 forks source link

Why I found Lidar-only segmentation performs better than fusion segmentation? #264

Closed Rivendell2898 closed 1 year ago

Rivendell2898 commented 1 year ago

I have tried both Lidar-only segmentation and fusion segmentation on both mini-dataset and 1/10 full training dataset for a simple explore. I found something strange. for the result map/mean/iou@max, the camera-only segmentation performed worse, Lidar-only segmentation best, even better than fusion segmentation. Here are the result for using 1/10 dataset to train camera-only segmentation: map/mean/iou@max: 0.2376 lidar-only segmentation:map/mean/iou@max: 0.3083 fusion segmentation:map/mean/iou@max: 0.3051 The result is confusing. Is it the reason that the dataset I use is too few? Thanks a lot!!!

Rivendell2898 commented 1 year ago

I found --load_from pretrained/lidar-only-det.pth in BEVFusion detection model training. Should I add the same line in BEVFusion segmentation model training, like this torchpack dist-run -np 1 python tools/train.py configs/nuscenes/seg/fusion-bev256d2-lss.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth --load_from pretrained/lidar-only-det.pth Thank you!

kentang-mit commented 1 year ago

It's not a good idea to load pretrained detection model as an initialization for segmentation models. Please refer to our training commands in README.md to reproduce our results. By the way, I do not quite recommend using a downsampled dataset to compare results from different methods. The main reason is that our fusion model used quite heavy data augmentation. Downsampling the dataset means that you will have fewer iterations, and the model may not converge very well.

Rivendell2898 commented 1 year ago

Thanks a lot for your reply! I loaded the lidar only segmentation model trained by myself (not detection model) to further train the BEVFusion segmentation model. I found there is no need to load the lidar model in BEVFusion segmentation model training, but I need to add --load_from pretrained/lidar-only-det.pth to train BEVFusion detection model. Why there are differences between the training of BEVFusion segmentation model and BEVFusion detection model?

kentang-mit commented 1 year ago

Good point. I think the detection models on nuScenes are more prune to overfitting than the segmentation models according to our experiments. Therefore if you get started with a multi-modal model (which contains way more parameters than the LiDAR-only model), there is a big risk that you are going to overfit in the middle of the training process. You could use the multi-modal data augmentation to alleviate such overfitting (e.g. PointAugmenting, MoCa are very good papers on this direction), but at the end of day this will complicate the training schedule.

Rivendell2898 commented 1 year ago

Yes! If I load the LiDAR-only model and train the fusion model, it just need 3 or 4 epoches before its overfitting. Thanks for your kind reply.