pjreddie / darknet

Convolutional Neural Networks
http://pjreddie.com/darknet/
Other
25.73k stars 21.33k forks source link

CUDA Error: out of memory - GTX950 #1620

Open ricardo-zzz opened 5 years ago

ricardo-zzz commented 5 years ago

I'm trying to train yolov3 on a gigabyte GTX950 GPU and darknet. But I ALWAYS get CUDA error after a few dozens iterations:

`Region 82 Avg IOU: 0.081827, Class: 0.586256, Obj: 0.507294, No Obj: 0.642290, .5R: 0.000000, .75R: 0.000000, count: 3

Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.495840, .5R: -nan, .75R: -nan, count: 0

Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.414020, .5R: -nan, .75R: -nan, count: 0

Region 82 Avg IOU: 0.519027, Class: 0.642831, Obj: 0.572576, No Obj: 0.644372, .5R: 1.000000, .75R: 0.000000, count: 1

Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.495159, .5R: -nan, .75R: -nan, count: 0

Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.415090, .5R: -nan, .75R: -nan, count: 0

20: 497.899017, 660.693054 avg, 0.000000 rate, 11.404367 seconds, 1280 images

Resizing

608

CUDA Error: out of memory darknet: ./src/cuda.c:36: check_error: Assertion 0 failed. Aborted (core dumped) `

Here is nvidia-smi output right before the error and just after it: https://paste.ofcode.org/4jTNdhQFVpsTWrkVRkqB5C It seems the memory usage from ./darknet started at 1075MiB, then jumped to 1230MiB, then 1658MiB and then failed.

I installed CUDA, OPENCV and then compiled darknet like that: GPU=1 CUDNN=0 OPENCV=0 OPENMP=0 DEBUG=0 on a fresh Ubuntu 16.04 OS. I skipped the CUDNN installation, what is that for? can it be the cause for CUDA error?

I tried several combinations of batch and subdivisions, but here is my last take (copy attached) batch=64 subdivisions=64 width=320 height=320 yolov3 (copy).cfg.txt

the other files and dataset comes from here: https://www.learnopencv.com/training-yolov3-deep-learning-based-custom-object-detector/

The command I'm using to run the darknet is this one: ./darknet detector train /home/ricardo/Documents/PythonDN/darknet.data /home/ricardo/Documents/PythonDN/darknet-yolov3.cfg ./darknet53.conv.74 -gpus 0

Am I doing something wrong or my GPU is simply too weak? why is the memory usage increasing over time?

THANK YOU

edit: wrong cfg file, more details

ricardo-zzz commented 5 years ago

It is the GPU, tiny yolo trained fine

DeepNoob commented 5 years ago

Darknet works on varying image size and it reshuffles the image size every 10 batches. The image size will be 32 * (10 + x) where x is in (0..9). The problem is that for your batch size (amount of images per batch), when the image is resized to 608x608 (max size), you encounter out of memory on your GPU device. The reason Tiny Yolo works for you is that it requires less memory. Also you use a different configuration file so the parameters used for it might be different.

Possible solutions :

  1. Increase the subdivision parameter on the configuration file. It allows you to work with the same batch size for backwards purposes but will split each batch into sub-batches.
  2. Reduce the batch size - that will affect the training though, so better use 1.
  3. Increase the GPU memory.

Obviously 1. is the best solution even though it might slightly affect your training speed

ricardo-zzz commented 5 years ago

Darknet works on varying image size and it reshuffles the image size every 10 batches. The image size will be 32 * (10 + x) where x is in (0..9). The problem is that for your batch size (amount of images per batch), when the image is resized to 608x608 (max size), you encounter out of memory on your GPU device. The reason Tiny Yolo works for you is that it requires less memory. Also you use a different configuration file so the parameters used for it might be different.

Possible solutions :

  1. Increase the subdivision parameter on the configuration file. It allows you to work with the same batch size for backwards purposes but will split each batch into sub-batches.
  2. Reduce the batch size - that will affect the training though, so better use 1.
  3. Increase the GPU memory.

Obviously 1. is the best solution even though it might slightly affect your training speed

so, my GPU memory is 2000mb, If I add a second one it should work?

Mr-Optimistic commented 4 years ago

@DeepNoob if the max size would be 608 x 608; what if I provide input size in conf as 768 x 768; what sizes would be considered?

Any way of knowing the right configuration for a given GPU size?

Like I got the CUDA memory error after 9k iterations

Image input size in configuration: 608*608; Batch Size: 24; Subdivisions: 24 GPU: geforce gtx 1060 6gb