mit-han-lab / bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
https://bevfusion.mit.edu
Apache License 2.0
2.35k stars 423 forks source link

Reproduction problem. #166

Closed yinjunbo closed 2 years ago

yinjunbo commented 2 years ago

Thanks for your nice work. But which config or command should be used to train the fusion model(configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml)?

yinjunbo commented 2 years ago

As far as I understand, the final fusion model can be end-to-end trained by: torchpack dist-run -np 8 python tools/train.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth looking forward to your reply, thanks!

kentang-mit commented 2 years ago

Hi @yinjunbo,

As I have mentioned in other issues, we take a two-stage training pipeline. The LiDAR-only model is trained first and then we load the weights and finetune the camera+LiDAR BEVFusion model. I believe the current code release has provided enough details for researchers in this field to easily reproduce our camera+LiDAR BEVFusion results.

Best, Haotian

jiapeng789 commented 2 years ago

Hi @yinjunbo,

As I have mentioned in other issues, we take a two-stage training pipeline. The LiDAR-only model is trained first and then we load the weights and finetune the camera+LiDAR BEVFusion model. I believe the current code release has provided enough details for researchers in this field to easily reproduce our camera+LiDAR BEVFusion results.

Best, Haotian

What I understand is that for the object detection model of camera+lidar fusion, first train the object detection model of lidar-only, and then fine-tune the training weight on the camera-lidar BEVFusion model. For the semantic segmentation model of camera+lidar fusion, the camera-only semantic segmentation model is trained first, and then the training weight is fine-tuned on the camera-lidar BEVfusion model. Is that so? thank you.

yinjunbo commented 2 years ago

Hi @yinjunbo,

As I have mentioned in other issues, we take a two-stage training pipeline. The LiDAR-only model is trained first and then we load the weights and finetune the camera+LiDAR BEVFusion model. I believe the current code release has provided enough details for researchers in this field to easily reproduce our camera+LiDAR BEVFusion results.

Best, Haotian

Thanks for your quick reply. When finetuning the Camera+LiDAR BEVFusion model, do we need to pre-train the Camera-only model for detecton first, or directly using the nuimage pretrained model is ok for the fusion model?

kentang-mit commented 2 years ago

Hi @yinjunbo,

No, you do not need to pretrain on the camera-only task. At least my previous experiments show that using nuImages-pretrained model that we released give a better performance.

Best, Haotian

yinjunbo commented 2 years ago

Hi @yinjunbo,

No, you do not need to pretrain on the camera-only task. At least my previous experiments show that using nuImages-pretrained model that we released give a better performance.

Best, Haotian

Following your advice, I train the fusion model intinizated from your pre-trained lidar-only and camera-only models and get following results, which is 2 points lower than the readme(68.85mAP and 71.38NDS). Is there something I missed? Looking forword to your reply. @kentang-mit

mAP: 0.6628                              
mATE: 0.2782
mASE: 0.2567
mAOE: 0.2984
mAVE: 0.2367
mAAE: 0.1877
NDS: 0.7056
kentang-mit commented 2 years ago

Hi @yinjunbo,

There are several things you can try.

First, try out using tools/test.py to evaluate the model after training instead of directly reading out the results from the training log. This will give you better (and actually correct) numbers. I'm currently not sure why the numbers during training and the results given by tools/test.py are different after I refactored the code recently.

Second, you can tune your learning rate schedule and the data augmentations. I think we have released enough details about it. Please make sure you do not turn on GT augmentation during BEVFusion (C+L) training, that will hurt the performance because you did not synchronize augmentations for LiDAR and camera.

Third, I'm not sure whether I understood your comment correctly, but I suggest you not to initialize from the pretrained camera-only 3D detection model but instead from the pretrained camera-only 2D detection model on nuImages.

Best, Haotian

RG2806 commented 2 years ago

Hey @kentang-mit , can you clarify which augmentation are you referring to, to turn off during fusion training?

yinjunbo commented 2 years ago

Hi @yinjunbo,

There are several things you can try.

First, try out using tools/test.py to evaluate the model after training instead of directly reading out the results from the training log. This will give you better (and actually correct) numbers. I'm currently not sure why the numbers during training and the results given by tools/test.py are different after I refactored the code recently.

Second, you can tune your learning rate schedule and the data augmentations. I think we have released enough details about it. Please make sure you do not turn on GT augmentation during BEVFusion (C+L) training, that will hurt the performance because you did not synchronize augmentations for LiDAR and camera.

Third, I'm not sure whether I understood your comment correctly, but I suggest you not to initialize from the pretrained camera-only 3D detection model but instead from the pretrained camera-only 2D detection model on nuImages.

Best, Haotian

Hi, @kentang-mit , thanks for your kind reply.

  1. I've tried tools/test.py, and it really improves the results (less than 0.5 points).
  2. I didn't turn on GT augmentation since the default gt_paste_stop_epoch is set as -1. Btw, I notice the default learning rate 1e-4 is set according to 8 GPU and batch size 4. Do we need to further tune it? And is there any other augmentation strategy need to be turn off besides GT-AUG?
  3. Both camera-only 3D detection model and camera-only 2D detection model on nuImages have been tried in my experiments, and similar results are obtained, e.g., 66.28 mAP v.s 66.30 mAP.

Could you please share your training log (both lidar-only and fusion model), so I can see the reason of my lower performance?

kentang-mit commented 2 years ago

Hi @yinjunbo,

Learning rate and finetune schedule should be tuned, and GT augmentation is not used during the training process of camera+LiDAR fusion models as I stated in other issues. For the training log, I'm afraid I cannot share it publicly at the current timestamp, I would appreciate it if you can understand that I do have some concerns on it right now. However, I will make the training configurations public in the end of November.

Best, Haotian

yinjunbo commented 2 years ago

Hi @yinjunbo,

Learning rate and finetune schedule should be tuned, and GT augmentation is not used during the training process of camera+LiDAR fusion models as I stated in other issues. For the training log, I'm afraid I cannot share it publicly at the current timestamp, I would appreciate it if you can understand that I do have some concerns on it right now. However, I will make the training configurations public in the end of November.

Best, Haotian

Got it. Thanks for your time!

kentang-mit commented 2 years ago

No problem. I'm closing it temporarily. Feel free to reopen if you have further questions.