Closed n-albert closed 2 years ago
Hi, try to reduce the batch size, this may solve the problem.
@mtjhl even with the batch size 2 is taking upto 8 epochs max.
@mtjhl It seems that the error was due to my mistake.
The issue was caused by the fact that I was training on a custom dataset with 1 class/category, which should have been labeled 0 in the annotations. But the tool I was using to make the annotations had set them to 15. After making the change from 15 to 0 did this error go away.
Sorry about that.
@mtjhl It seems that the error was due to my mistake.
The issue was caused by the fact that I was training on a custom dataset with 1 class/category, which should have been labeled 0 in the annotations. But the tool I was using to make the annotations had set them to 15. After making the change from 15 to 0 did this error go away.
Sorry about that.
You saved my life, I had exactly the same error :) Thanks buddy
@ilkin94 Bless up. Glad I could help!
@mtjhl It seems that the error was due to my mistake.
The issue was caused by the fact that I was training on a custom dataset with 1 class/category, which should have been labeled 0 in the annotations. But the tool I was using to make the annotations had set them to 15. After making the change from 15 to 0 did this error go away.
Sorry about that.
by annotations you mean in the .json file or the .txt file?
Hi,
I'm having trouble starting a training run on Google Colab. I'm running it on one of the Tesla-P100 GPUs, so I would assume that memory would not be an issue at 416px img size, 1 batch size, and low settings like that.
But I seem to be getting the following type of error. Any insight on what could be the issue?
Epoch iou_loss l1_loss obj_loss cls_loss 0% 0/1630 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] OOM RuntimeError is raised due to the huge memory cost during label assignment. CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size. 0% 0/1630 [00:00<?, ?it/s] ERROR in training steps. ERROR in training loop or eval/save model.
Training completed in 0.000 hours. Traceback (most recent call last): File "/content/YOLOv6/yolov6/models/loss.py", line 114, in call num_classes File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/content/YOLOv6/yolov6/models/loss.py", line 299, in get_assignments ).sum(-1) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/content/YOLOv6/yolov6/core/engine.py", line 75, in train self.train_in_loop() File "/content/YOLOv6/yolov6/core/engine.py", line 88, in train_in_loop self.train_in_steps() File "/content/YOLOv6/yolov6/core/engine.py", line 105, in train_in_steps total_loss, loss_items = self.compute_loss(preds, targets) File "/content/YOLOv6/yolov6/models/loss.py", line 123, in call torch.cuda.empty_cache() File "/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py", line 114, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
During handling of the above exception, another exception occurred: