Closed funnymean closed 1 year ago
Hi @funnymean , This issue due to GPU memory. You can solve this by reducing batch size or use high memory GPU. I think this solve your issue. If you have any issues, please let me know.
I noticed a phenomenon. Training epoch 0, when I started training (3%), the memory was 6357M. When training to 59% of epoch 0, the memory rose to 9263M. And the memory usage will continue to increase until it crashes(epoch 10). I don't know if this situation is normal or not.
Hi @funnymean ,
Try with batch=4
I followed the steps and used the conda environment you provided.But there are some errors when training.Please help me !
Train epoch 10: 47%|████▋ | 3459/7368 [18:21<20:45, 3.14it/s, PPYoloELoss/loss=1.63, PPYoloELoss/loss_cls=0.811, PPYoloELoss/loss_dfl=0.731, PPYoloELoss/los
[2023-11-14 01:14:50] INFO - base_sg_logger.py - [CLEANUP] - Successfully stopped system monitoring process
[2023-11-14 01:14:50] ERROR - sg_trainer_utils.py - Uncaught exception
Traceback (most recent call last):
File "/home/un/Code/YOLO-NAS/train.py", line 239, in
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1361, in train
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 442, in _train_epoch
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 475, in _get_losses
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 774, in forward
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 517, in forward
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 230, in gather_topk_anchors
RuntimeError: CUDA out of memory. Tried to allocate 3.55 GiB (GPU 0; 10.75 GiB total capacity; 2.56 GiB already allocated; 3.40 GiB free; 5.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/home/un/Code/YOLO-NAS/train.py", line 239, in
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1361, in train
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 442, in _train_epoch
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 475, in _get_losses
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 774, in forward
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 517, in forward
File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 230, in gather_topk_anchors
RuntimeError: CUDA out of memory. Tried to allocate 3.55 GiB (GPU 0; 10.75 GiB total capacity; 2.56 GiB already allocated; 3.40 GiB free; 5.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF