Closed Rivendell2898 closed 1 year ago
I found --load_from pretrained/lidar-only-det.pth
in BEVFusion detection model training. Should I add the same line in BEVFusion segmentation model training, like this
torchpack dist-run -np 1 python tools/train.py configs/nuscenes/seg/fusion-bev256d2-lss.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth --load_from pretrained/lidar-only-det.pth
Thank you!
It's not a good idea to load pretrained detection model as an initialization for segmentation models. Please refer to our training commands in README.md to reproduce our results. By the way, I do not quite recommend using a downsampled dataset to compare results from different methods. The main reason is that our fusion model used quite heavy data augmentation. Downsampling the dataset means that you will have fewer iterations, and the model may not converge very well.
Thanks a lot for your reply! I loaded the lidar only segmentation model trained by myself (not detection model) to further train the BEVFusion segmentation model.
I found there is no need to load the lidar model in BEVFusion segmentation model training, but I need to add --load_from pretrained/lidar-only-det.pth
to train BEVFusion detection model. Why there are differences between the training of BEVFusion segmentation model and BEVFusion detection model?
Good point. I think the detection models on nuScenes are more prune to overfitting than the segmentation models according to our experiments. Therefore if you get started with a multi-modal model (which contains way more parameters than the LiDAR-only model), there is a big risk that you are going to overfit in the middle of the training process. You could use the multi-modal data augmentation to alleviate such overfitting (e.g. PointAugmenting, MoCa are very good papers on this direction), but at the end of day this will complicate the training schedule.
Yes! If I load the LiDAR-only model and train the fusion model, it just need 3 or 4 epoches before its overfitting. Thanks for your kind reply.
I have tried both Lidar-only segmentation and fusion segmentation on both mini-dataset and 1/10 full training dataset for a simple explore. I found something strange. for the result map/mean/iou@max, the camera-only segmentation performed worse, Lidar-only segmentation best, even better than fusion segmentation. Here are the result for using 1/10 dataset to train camera-only segmentation: map/mean/iou@max: 0.2376 lidar-only segmentation:map/mean/iou@max: 0.3083 fusion segmentation:map/mean/iou@max: 0.3051 The result is confusing. Is it the reason that the dataset I use is too few? Thanks a lot!!!