Closed South-River closed 1 year ago
Hey @South-River, can you contact me in this email: antonistragoudaras@gmail.com . I am trying to experiment with v1.0-mini nuscenes dataset as well and I wanted to ask you a few questions on how you set up the training/test for v1.0-mini and if you tried to just evaluate the pretrained model with it's respective config. I cannot even run the test.py successfully(torchpack dist-run -np 1 -v python tools/test.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox). Maybe the thing is that I only have a n RTX 3050 Ti with 4GB of VRAM. Generally I browsed the issues and the creators/authors have not tested the v1.0-mini dataset. Maybe you have to experiment on your own to find the optimal training configs in the respective yaml files.
Hey @South-River, can you contact me in this email: antonistragoudaras@gmail.com . I am trying to experiment with v1.0-mini nuscenes dataset as well and I wanted to ask you a few questions on how you set up the training/test for v1.0-mini and if you tried to just evaluate the pretrained model with it's respective config. I cannot even run the test.py successfully(torchpack dist-run -np 1 -v python tools/test.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox). Maybe the thing is that I only have a n RTX 3050 Ti with 4GB of VRAM. Generally I browsed the issues and the creators/authors have not tested the v1.0-mini dataset. Maybe you have to experiment on your own to find the optimal training configs in the respective yaml files.
I only managed to ran visualize.py and train.py. I tried to ran test.py but got a few problems as well. But you can get these test results in train.py as well, so I don't try to fix these problems. Since the author only released pretrained model, I think train.py is a must way to go. I can get these results in train.py and save its screen output for future use, so I haven't try it.
I'm sorry that I never carried out experiments on v1.0-mini. But in our experiments, the fusion model is always finetuned from a pre-trained LiDAR-only model. If you just train the fusion model from scratch for 6 epochs, it is not going to converge to a satisfactory accuracy.
@South-River Hi could you share how you produce this result? I'm also trying on mini dataset and I first trained lidar-only det for 20 epochs, then lidar+camera det for 6 epochs using their pretrained camera checkpoint. However I can only get NDS 0.26 in minidataset. Thanks in advance!
Sorry, I only tried camera+lidar det. Simply run the command in readme.md. The only thing I changed is the batch size, because I only have a 3090, so I use
samples_per_gpu: 2
Hope it is useful to you.
@South-River Thanks for your quick reply! I'll give it a try now:)
When I use the instructions mentioned in readme.md for training. Command line instructions:torchpack dist-run -np 2 python tools/train.py configs/nuscenes/seg/fusion-bev256d2-lss.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth
The output appears to be stuck.
I have tried changing the Shapely library version to 1.8.0 and adding environment variablesCUDA_LAUNCH_BLOCKING=1
to force CUDA synchronization, but none of them has helped. (My current shapely version is 1.8.5).I think this issue may be related to the following factors: My version of Torchpack is 0.3.1, which may have an incompatibility issue. In addition, I am using the Slurm job management system, which may cause the process to be killed.Do you have any other suggestions or solutions? If you need to provide more information, please let me know. Thank you for your help!
Hi @hasaikeyQAQ,
We have not experimented with slurm before and our repo is developed upon MPI. So it could be hard for me to debug slurm-related problems. Is it possible for you to add a -v
flag to your launch command and see what errors it returns?
Best, Haotian
Hi @hasaikeyQAQ,
We have not experimented with slurm before and our repo is developed upon MPI. So it could be hard for me to debug slurm-related problems. Is it possible for you to add a
-v
flag to your launch command and see what errors it returns?Best, Haotian
Dear Haotian,
I have received your reply and thank you for your help.
Best regards
After 6 epochs of training under default configs (camera + lidar, det, change batch size to 1), the test result is
I want to know if the result is correct, since it seems far worse than the model in readme.md And I also tried to train it longer, but it seems 6 epoch is enough for it to converge, the result haven't change after 6 epochs.