mit-han-lab / bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
https://bevfusion.mit.edu
Apache License 2.0
2.35k stars 423 forks source link

train interrupt in the end of the first epoch #381

Closed Surtr07 closed 1 year ago

Surtr07 commented 1 year ago

I train the model on 3 RTX3090 and finish the first epoch after about 10hours ,but its stuck on the end of the first epoch before save checkpoint. I use this command to train. torchpack dist-run -np 3 python tools/train.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth --load_from pretrained/lidar-only-det.pth

594954fa408f16cc301e1a284c16a0a acdad41c9c3d50fffc722c05b0c53ce
kentang-mit commented 1 year ago

Hi @Surtr07,

Sorry for the delayed response. I was busy working on other projects recently. Judging from the snapshots you provided, it does not seem to me that you ran into a potentially OOM problem, and it seems that the GPUs still have high occupancies. You may add a -v flag to your command and see if there are any error messages shown on the screen when the program freezes. Besides, you could also try skipping the evaluation on validation set by setting this parameter to a very large value. Let me know if you have further information.

Best, Haotian