Closed Surtr07 closed 1 year ago
Hi @Surtr07,
Sorry for the delayed response. I was busy working on other projects recently. Judging from the snapshots you provided, it does not seem to me that you ran into a potentially OOM problem, and it seems that the GPUs still have high occupancies. You may add a -v
flag to your command and see if there are any error messages shown on the screen when the program freezes. Besides, you could also try skipping the evaluation on validation set by setting this parameter to a very large value. Let me know if you have further information.
Best, Haotian
I train the model on 3 RTX3090 and finish the first epoch after about 10hours ,but its stuck on the end of the first epoch before save checkpoint. I use this command to train. torchpack dist-run -np 3 python tools/train.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth --load_from pretrained/lidar-only-det.pth