naseemap47 / YOLO-NAS

Train and Inference your custom YOLO-NAS model by Single Command Line
Apache License 2.0
98 stars 13 forks source link

memory err when training #51

Closed funnymean closed 1 year ago

funnymean commented 1 year ago

I followed the steps and used the conda environment you provided.But there are some errors when training.Please help me !


Train epoch 10: 47%|████▋ | 3459/7368 [18:21<20:45, 3.14it/s, PPYoloELoss/loss=1.63, PPYoloELoss/loss_cls=0.811, PPYoloELoss/loss_dfl=0.731, PPYoloELoss/los

[2023-11-14 01:14:50] INFO - base_sg_logger.py - [CLEANUP] - Successfully stopped system monitoring process

[2023-11-14 01:14:50] ERROR - sg_trainer_utils.py - Uncaught exception

Traceback (most recent call last):

File "/home/un/Code/YOLO-NAS/train.py", line 239, in

trainer.train( 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1361, in train

train_metrics_tuple = self._train_epoch(context=context, silent_mode=silent_mode) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 442, in _train_epoch

loss, loss_log_items = self._get_losses(outputs, targets) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 475, in _get_losses

loss = self.criterion(outputs, targets) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl

return forward_call(*input, **kwargs) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 774, in forward

assigned_labels, assigned_bboxes, assigned_scores = self.assigner( 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl

return forward_call(*input, **kwargs) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context

return func(*args, **kwargs) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 517, in forward

is_in_topk = gather_topk_anchors(alignment_metrics * is_in_gts, self.topk, topk_mask=pad_gt_mask) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 230, in gather_topk_anchors

is_in_topk = torch.nn.functional.one_hot(topk_idxs, num_anchors).sum(dim=-2).type_as(metrics) 

RuntimeError: CUDA out of memory. Tried to allocate 3.55 GiB (GPU 0; 10.75 GiB total capacity; 2.56 GiB already allocated; 3.40 GiB free; 5.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Traceback (most recent call last):

File "/home/un/Code/YOLO-NAS/train.py", line 239, in

trainer.train( 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1361, in train

train_metrics_tuple = self._train_epoch(context=context, silent_mode=silent_mode) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 442, in _train_epoch

loss, loss_log_items = self._get_losses(outputs, targets) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 475, in _get_losses

loss = self.criterion(outputs, targets) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl

return forward_call(*input, **kwargs) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 774, in forward

assigned_labels, assigned_bboxes, assigned_scores = self.assigner( 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl

return forward_call(*input, **kwargs) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context

return func(*args, **kwargs) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 517, in forward

is_in_topk = gather_topk_anchors(alignment_metrics * is_in_gts, self.topk, topk_mask=pad_gt_mask) 

File "/home/un/mambaforge/envs/yolo-nas/lib/python3.9/site-packages/super_gradients/training/losses/ppyolo_loss.py", line 230, in gather_topk_anchors

is_in_topk = torch.nn.functional.one_hot(topk_idxs, num_anchors).sum(dim=-2).type_as(metrics) 

RuntimeError: CUDA out of memory. Tried to allocate 3.55 GiB (GPU 0; 10.75 GiB total capacity; 2.56 GiB already allocated; 3.40 GiB free; 5.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

naseemap47 commented 1 year ago

Hi @funnymean , This issue due to GPU memory. You can solve this by reducing batch size or use high memory GPU. I think this solve your issue. If you have any issues, please let me know.

funnymean commented 1 year ago

I noticed a phenomenon. Training epoch 0, when I started training (3%), the memory was 6357M. When training to 59% of epoch 0, the memory rose to 9263M. And the memory usage will continue to increase until it crashes(epoch 10). 1 2 I don't know if this situation is normal or not.

naseemap47 commented 1 year ago

Hi @funnymean , Try with batch=4