Memory Issue on Google Colab

n-albert commented 2 years ago

Hi,

I'm having trouble starting a training run on Google Colab. I'm running it on one of the Tesla-P100 GPUs, so I would assume that memory would not be an issue at 416px img size, 1 batch size, and low settings like that.

But I seem to be getting the following type of error. Any insight on what could be the issue?

Epoch iou_loss l1_loss obj_loss cls_loss 0% 0/1630 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] OOM RuntimeError is raised due to the huge memory cost during label assignment. CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size. 0% 0/1630 [00:00<?, ?it/s] ERROR in training steps. ERROR in training loop or eval/save model.

Training completed in 0.000 hours. Traceback (most recent call last): File "/content/YOLOv6/yolov6/models/loss.py", line 114, in call num_classes File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/content/YOLOv6/yolov6/models/loss.py", line 299, in get_assignments ).sum(-1) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/content/YOLOv6/yolov6/core/engine.py", line 75, in train self.train_in_loop() File "/content/YOLOv6/yolov6/core/engine.py", line 88, in train_in_loop self.train_in_steps() File "/content/YOLOv6/yolov6/core/engine.py", line 105, in train_in_steps total_loss, loss_items = self.compute_loss(preds, targets) File "/content/YOLOv6/yolov6/models/loss.py", line 123, in call torch.cuda.empty_cache() File "/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py", line 114, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "tools/train.py", line 112, in main(args) File "tools/train.py", line 102, in main trainer.train() File "/content/YOLOv6/yolov6/core/engine.py", line 81, in train self.train_after_loop() File "/content/YOLOv6/yolov6/core/engine.py", line 190, in train_after_loop torch.cuda.empty_cache() File "/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py", line 114, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [0,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [2,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [3,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [4,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:174 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f7cb7ce67d2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so) frame #1: + 0x267df7a (0x7f7d0ad05f7a in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda_cpp.so) frame #2: + 0x301898 (0x7f7d6d0ff898 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so) frame #3: c10::TensorImpl::release_resources() + 0x175 (0x7f7cb7ccf005 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so) frame #4: + 0x1edf69 (0x7f7d6cfebf69 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so) frame #5: + 0x4e5818 (0x7f7d6d2e3818 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so) frame #6: THPVariable_subclass_dealloc(_object*) + 0x299 (0x7f7d6d2e3b19 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so) frame #7: python3() [0x5a97d4] frame #8: python3() [0x5656cf] frame #9: python3() [0x536b92] frame #10: python3() [0x5a9afc] frame #11: python3() [0x4fa7d8] frame #12: python3() [0x4fa7ec] frame #13: python3() [0x561a71]
frame #17: python3() [0x64f939] frame #19: __libc_start_main + 0xe7 (0x7f7d71806c87 in /lib/x86_64-linux-gnu/libc.so.6)

mtjhl commented 2 years ago

Hi, try to reduce the batch size, this may solve the problem.

Shibaditya99 commented 2 years ago

@mtjhl even with the batch size 2 is taking upto 8 epochs max.

n-albert commented 2 years ago

@mtjhl It seems that the error was due to my mistake.

The issue was caused by the fact that I was training on a custom dataset with 1 class/category, which should have been labeled 0 in the annotations. But the tool I was using to make the annotations had set them to 15. After making the change from 15 to 0 did this error go away.

Sorry about that.

ilkin94 commented 2 years ago

@mtjhl It seems that the error was due to my mistake.

The issue was caused by the fact that I was training on a custom dataset with 1 class/category, which should have been labeled 0 in the annotations. But the tool I was using to make the annotations had set them to 15. After making the change from 15 to 0 did this error go away.

Sorry about that.

You saved my life, I had exactly the same error :) Thanks buddy

n-albert commented 2 years ago

@ilkin94 Bless up. Glad I could help!

geekdreamer04 commented 2 years ago

@mtjhl It seems that the error was due to my mistake.

The issue was caused by the fact that I was training on a custom dataset with 1 class/category, which should have been labeled 0 in the annotations. But the tool I was using to make the annotations had set them to 15. After making the change from 15 to 0 did this error go away.

Sorry about that.

by annotations you mean in the .json file or the .txt file?

meituan / YOLOv6

Memory Issue on Google Colab #281