Open ricardo-zzz opened 5 years ago
It is the GPU, tiny yolo trained fine
Darknet works on varying image size and it reshuffles the image size every 10 batches. The image size will be 32 * (10 + x) where x is in (0..9). The problem is that for your batch size (amount of images per batch), when the image is resized to 608x608 (max size), you encounter out of memory on your GPU device. The reason Tiny Yolo works for you is that it requires less memory. Also you use a different configuration file so the parameters used for it might be different.
Possible solutions :
Obviously 1. is the best solution even though it might slightly affect your training speed
Darknet works on varying image size and it reshuffles the image size every 10 batches. The image size will be 32 * (10 + x) where x is in (0..9). The problem is that for your batch size (amount of images per batch), when the image is resized to 608x608 (max size), you encounter out of memory on your GPU device. The reason Tiny Yolo works for you is that it requires less memory. Also you use a different configuration file so the parameters used for it might be different.
Possible solutions :
- Increase the subdivision parameter on the configuration file. It allows you to work with the same batch size for backwards purposes but will split each batch into sub-batches.
- Reduce the batch size - that will affect the training though, so better use 1.
- Increase the GPU memory.
Obviously 1. is the best solution even though it might slightly affect your training speed
so, my GPU memory is 2000mb, If I add a second one it should work?
@DeepNoob if the max size would be 608 x 608; what if I provide input size in conf as 768 x 768; what sizes would be considered?
Any way of knowing the right configuration for a given GPU size?
Like I got the CUDA memory error after 9k iterations
Image input size in configuration: 608*608; Batch Size: 24; Subdivisions: 24 GPU: geforce gtx 1060 6gb
I'm trying to train yolov3 on a gigabyte GTX950 GPU and darknet. But I ALWAYS get CUDA error after a few dozens iterations:
`Region 82 Avg IOU: 0.081827, Class: 0.586256, Obj: 0.507294, No Obj: 0.642290, .5R: 0.000000, .75R: 0.000000, count: 3
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.495840, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.414020, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.519027, Class: 0.642831, Obj: 0.572576, No Obj: 0.644372, .5R: 1.000000, .75R: 0.000000, count: 1
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.495159, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.415090, .5R: -nan, .75R: -nan, count: 0
20: 497.899017, 660.693054 avg, 0.000000 rate, 11.404367 seconds, 1280 images
Resizing
608
CUDA Error: out of memory darknet: ./src/cuda.c:36: check_error: Assertion 0 failed. Aborted (core dumped) `
Here is nvidia-smi output right before the error and just after it: https://paste.ofcode.org/4jTNdhQFVpsTWrkVRkqB5C It seems the memory usage from ./darknet started at 1075MiB, then jumped to 1230MiB, then 1658MiB and then failed.
I installed CUDA, OPENCV and then compiled darknet like that:
GPU=1 CUDNN=0 OPENCV=0 OPENMP=0 DEBUG=0
on a fresh Ubuntu 16.04 OS. I skipped the CUDNN installation, what is that for? can it be the cause for CUDA error?I tried several combinations of batch and subdivisions, but here is my last take (copy attached)
batch=64 subdivisions=64 width=320 height=320
yolov3 (copy).cfg.txtthe other files and dataset comes from here: https://www.learnopencv.com/training-yolov3-deep-learning-based-custom-object-detector/
The command I'm using to run the darknet is this one:
./darknet detector train /home/ricardo/Documents/PythonDN/darknet.data /home/ricardo/Documents/PythonDN/darknet-yolov3.cfg ./darknet53.conv.74 -gpus 0
Am I doing something wrong or my GPU is simply too weak? why is the memory usage increasing over time?
THANK YOU
edit: wrong cfg file, more details