Training interrupted without reason

autumn-wong commented 1 year ago

Thanks for your excellent job! We have encountered some problems during reproduction. We use 3 3060 12G GPUs, so we adjust:  the batchsize to 1 (in nuscenes/default.yaml samplers_per_gpu:1 workers_per_gpu:1)  the learning rate to 1/10, our batchsize is 3 ( configs/nuscenes/det/centerhead/lssfpn/default.yaml 22行 lr: 2.0e-5 configs/nuscenes/det/transfusion/secfpn/default.yaml 35行 lr: 1.0e-5 configs/nuscenes/det/transfusion/secfpn/camera+lidar/default.yaml 60行 lr: 2.0e-5) Then we directly trained the C+L fusion model (torchpack dist-run -np 3 python tools/train.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth --load_from pretrained/lidar-only-det.pth ) The training was success on nuscenes mini dataset , and we got nds 54.04. However, when we trained on nuscenes compelete dataset ,the training always stop on the 6th epoch with no errors or warnings, the best result on the 5th epoch is 69.99. Is there any advice to finish the training? We can’t train the fusion model directly?

kentang-mit commented 1 year ago

I think a very likely reason for the stuck is that our model runs out of memory on your device. A quick way to check it out is to add a -v flag to your torchpack launching command and if there is any OOM, you will be notified when the program gets stuck. I vaguely remember that 12G of memory might not be enough to train our fusion model with batch=1.

gerardmartin2 commented 5 months ago

Hello, something similar happens to me in epoch 5 and I doubt that it has to do with the memory (I have tracked it and does not seem to increase more than expected). I have reduced lr by 1/10 of the original (since my batch size is 3; 3 gpus, 1 sample per gpu). in optimizer config and in min_lr_ratio of lr_config. I have tried with changing the lr at the start of the 5th epoch but it seems that any lr will slow down a lot training. Any ideas on how to solve it?

mit-han-lab / bevfusion

Training interrupted without reason #366