mit-han-lab / bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
https://bevfusion.mit.edu
Apache License 2.0
2.26k stars 409 forks source link

About nuscenes mini dataset v1.0-mini #339

Closed South-River closed 1 year ago

South-River commented 1 year ago

After 6 epochs of training under default configs (camera + lidar, det, change batch size to 1), the test result is

mAP: 0.4010
mATE: 0.4349
mASE: 0.4707
mAOE: 0.5300
mAVE: 0.5276
mAAE: 0.2990
NDS: 0.4743
Eval time: 2.1s

I want to know if the result is correct, since it seems far worse than the model in readme.md And I also tried to train it longer, but it seems 6 epoch is enough for it to converge, the result haven't change after 6 epochs.

antragoudaras commented 1 year ago

Hey @South-River, can you contact me in this email: antonistragoudaras@gmail.com . I am trying to experiment with v1.0-mini nuscenes dataset as well and I wanted to ask you a few questions on how you set up the training/test for v1.0-mini and if you tried to just evaluate the pretrained model with it's respective config. I cannot even run the test.py successfully(torchpack dist-run -np 1 -v python tools/test.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox). Maybe the thing is that I only have a n RTX 3050 Ti with 4GB of VRAM. Generally I browsed the issues and the creators/authors have not tested the v1.0-mini dataset. Maybe you have to experiment on your own to find the optimal training configs in the respective yaml files.

South-River commented 1 year ago

Hey @South-River, can you contact me in this email: antonistragoudaras@gmail.com . I am trying to experiment with v1.0-mini nuscenes dataset as well and I wanted to ask you a few questions on how you set up the training/test for v1.0-mini and if you tried to just evaluate the pretrained model with it's respective config. I cannot even run the test.py successfully(torchpack dist-run -np 1 -v python tools/test.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox). Maybe the thing is that I only have a n RTX 3050 Ti with 4GB of VRAM. Generally I browsed the issues and the creators/authors have not tested the v1.0-mini dataset. Maybe you have to experiment on your own to find the optimal training configs in the respective yaml files.

I only managed to ran visualize.py and train.py. I tried to ran test.py but got a few problems as well. But you can get these test results in train.py as well, so I don't try to fix these problems. Since the author only released pretrained model, I think train.py is a must way to go. I can get these results in train.py and save its screen output for future use, so I haven't try it.

kentang-mit commented 1 year ago

I'm sorry that I never carried out experiments on v1.0-mini. But in our experiments, the fusion model is always finetuned from a pre-trained LiDAR-only model. If you just train the fusion model from scratch for 6 epochs, it is not going to converge to a satisfactory accuracy.

zehaoj commented 1 year ago

@South-River Hi could you share how you produce this result? I'm also trying on mini dataset and I first trained lidar-only det for 20 epochs, then lidar+camera det for 6 epochs using their pretrained camera checkpoint. However I can only get NDS 0.26 in minidataset. Thanks in advance!

South-River commented 1 year ago

Sorry, I only tried camera+lidar det. Simply run the command in readme.md. The only thing I changed is the batch size, because I only have a 3090, so I use

samples_per_gpu: 2

Hope it is useful to you.

zehaoj commented 1 year ago

@South-River Thanks for your quick reply! I'll give it a try now:)

hasaikeyQAQ commented 1 year ago

When I use the instructions mentioned in readme.md for training. Command line instructions:torchpack dist-run -np 2 python tools/train.py configs/nuscenes/seg/fusion-bev256d2-lss.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pthThe output appears to be stuck. image I have tried changing the Shapely library version to 1.8.0 and adding environment variablesCUDA_LAUNCH_BLOCKING=1to force CUDA synchronization, but none of them has helped. (My current shapely version is 1.8.5).I think this issue may be related to the following factors: My version of Torchpack is 0.3.1, which may have an incompatibility issue. In addition, I am using the Slurm job management system, which may cause the process to be killed.Do you have any other suggestions or solutions? If you need to provide more information, please let me know. Thank you for your help!

kentang-mit commented 1 year ago

Hi @hasaikeyQAQ,

We have not experimented with slurm before and our repo is developed upon MPI. So it could be hard for me to debug slurm-related problems. Is it possible for you to add a -v flag to your launch command and see what errors it returns?

Best, Haotian

hasaikeyQAQ commented 1 year ago

Hi @hasaikeyQAQ,

We have not experimented with slurm before and our repo is developed upon MPI. So it could be hard for me to debug slurm-related problems. Is it possible for you to add a -v flag to your launch command and see what errors it returns?

Best, Haotian

Dear Haotian,

I have received your reply and thank you for your help.

Best regards