Allocator (GPU_0_bfc) ran out of memory

3ST4R commented 4 years ago

I'm trying to train my dataset of 3471 images, resolutions ranging from 640x480 to 1024x768, on my GPU (NVIDIA GTX 860M 2GB). On the start of training process, I get these log messages:

Epoch 1/500 2019-12-22 13:25:02.125644: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2019-12-22 13:25:03.040613: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows Relying on driver to perform ptx compilation. This message will be only logged once. 2019-12-22 13:25:03.327949: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-12-22 13:25:03.426295: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-12-22 13:25:03.468325: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature. 2019-12-22 13:25:03.739910: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-12-22 13:25:03.935164: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-12-22 13:25:04.181340: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.15GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-12-22 13:25:04.382404: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 148.88MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-12-22 13:25:04.513756: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-12-22 13:25:04.813110: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-12-22 13:25:04.839631: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-12-22 13:25:05.063777: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.03GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-12-22 13:25:24.117452: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll

I've also tried to lower the num_rois to 1, but still the log messages persist.

I also checked task manager for GPU Usage:

It seems that the memory threshold locks at 80% of GPU memory, preventing allocator to allocate memory beyond that.

Q1. Can we increase the memory threshold so that it can effectively utilize all the memory, rather than using 80% of the whole?

Q2. What could be the cause of these log messages reproducing even when num_rois set to 1

EDIT: Answer to Q1: Windows 10 allows only 81% of the GPU memory to the allocator, used by tensorflow. Try running the model on Windows 7 which allows 95% GPU memory allocation to tensorflow.

Janzeero-PhD commented 4 years ago

The same. So even num_rois =1 will not help (I tried 16). My GPU is Nvidia RTX 2060. Totally disappointed - cannot even run example of blood cells, not even talking about training on my dataset.

3ST4R commented 4 years ago

The same. So even num_rois =1 will not help (I tried 16). My GPU is Nvidia RTX 2060. Totally disappointed - cannot even run example of blood cells, not even talking about training on my dataset.

After 1 hour of head banging on internet, I found out that this model is too dense to be able to run on gaming GPUs (Some guy tested the model on 8 RTX 2080 Ti, still getting 500 seconds per epoch, assuming num_rois=32)

IMO possible solution to it is to geek peek into the model and lower the density and number of convolution layers, which would obviously mean decreased accuracy. Or load pre-trained weights of a similar model and fine-tune it, might help.

Also, comment if you find a way out to resolve the problem...

Janzeero-PhD commented 4 years ago

With n.rois = 4, I could train my model with 30 epochs and epoch.size = 250, or 10 epochs (might be 15 as well) with epoch.size = 500. Epoch.size = 1000 as default? May be 5-10 epochs maximum! Thus I not even face a problem that it takes 5-7-10-20 hours to train. I even am not able to set 100 epochs with epoch.size = 250! It makes 10-20 steps of first epoch and stops. Totally stops. So I could only spend 30-40 min to train 30 epochs of 250 epoch.size and get 85 % of classification accuracy on my training set. Test score is much worse, of course. And I cannot be sure whether it is because of low number of epochs or because of my small dataset. I guess, both problems contribute in...

3ST4R commented 4 years ago

Have you tried to load pre-trained weights into the model?

Janzeero-PhD commented 4 years ago

Weights of what kind? I downloaded resnet weights. I guess it was trained on the MS COCO dataset, probably (see documentation). I saved it into the keras-frcnn folder of this repo. Or you mean to train model using weights of my previously trained model on my custom dataset to try it again? :)

3ST4R commented 4 years ago

On running train_frcnn.py, I see stdout shows "Could not load pre-trained weights", downloaded resnet50 weights for tensorflow, manually added weights into the model, I get an error that the layers of the weight file are more/less than the model.

Did you also find the same? Any tips regarding it

P.S: I'm also working on custom dataset (focusing on detecting and recognizing car damage)

Janzeero-PhD commented 4 years ago

At first, I ran my model without downloaded weights. I supposed the model downloaded that running. Then I decided to check and downloaded the weights and saved it in the folder of repo. Nothing changes in terms of training velocity or accuracy. You can easily download it typing name of this file in the Google search. The first result from Github is that what I downloaded.

3ST4R commented 4 years ago

Then I decided to check and downloaded the weights and saved it in the folder of repo. Nothing changes in terms of training velocity or accuracy.

Because the weights are not getting loaded into the model.

If you take a close look on resnet.py, the first function is the one which loads the weights. Instead of downloading weight file and loading it into the model, it just returns a string.

on the function calling by train_frcnn.py, it fails to load the weights.

For that, I modified the function: sample

And added few lines of code in train_frcnn.py sample2

But it throws an error that the layers of the weight file and model doesn't match. I don't know how to overcome the error, to save the training time on my dataset

Janzeero-PhD commented 4 years ago

Hmmm... I guess on my extra-small dataset this Faster R-CNN would not have been reached even 85 % cl. accuracy on the training dataset if I had not been loaded weights in some way. And I guess these weights would not decrease training time, it should be used to get low level patterns of objects on the images to foster our training. So I suppose these weights are always loaded. Because in other case there should be a lot of issues about that, in this repo.

HakimFerchichi96 commented 4 years ago

i have the same error while i use a GTX 1050 with a 433 pic 299*299

kbardool / Keras-frcnn

Allocator (GPU_0_bfc) ran out of memory #57