tensorflow / models

Models and examples built with TensorFlow
Other
76.79k stars 45.85k forks source link

Out Of Memory when training on Big Images #1817

Closed pjeambrun closed 6 years ago

pjeambrun commented 7 years ago

Out Of Memory when training on Big Images

Systeme Information

Describe the Problem

I have successfully run the pets tutorial on this Google Compute Instance When I train a fasterrcnn resnet 101 on my dataset (VOC format, 47 classes, image_size: 1000/2000) with:

python object_detection/train.py --train_dir=data_xxxx --pipeline_config_path=data_xxxx/faster_rcnn_resnet101_pets.config

I get the following error at the beginning of the training:

2017-06-29 17:24:13.193833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:00:0b.0)
2017-06-29 17:24:15.414228: I tensorflow/core/common_runtime/simple_placer.cc:675] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Restoring parameters from /home/ubuntu/models/data_xxxx/model.ckpt
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path data_doliprane/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
[1]    4359 killed     python object_detection/train.py --train_dir=data_xxxx 

I managed to avoid the oom on this dataset by resizing all the images and annotation files (divided the dimensions by 4).

I didn't modify the config file(just the number of class and the paths), therefore they should be resized at 600 1024, and the bug should not occur with the big images

Is there a way I can train on my images without having to shrink them? Are there some parameters I can tune to avoid this problem?

derekjchow commented 7 years ago

Do you know what component is hitting OOM (CPU or GPU)? If CPU bound, it might be worth trying to shrink the queue sizes a little bit. The default values we use for queue_capacity and min_after_dequeue are 20001000. Try shrinking them to 1000/500 or 500/250 and see if that solves your OOM problems.

The section in your new config will look like this:

train_input_reader: {
  tf_record_input_reader {
    input_path: "PATH_TO_BE_CONFIGURED/pet_train.record"
  }
  label_map_path: "PATH_TO_BE_CONFIGURED/pet_label_map.pbtxt"
  queue_capacity: 500
  min_after_dequeue: 250
}
pjeambrun commented 7 years ago

Thank you very much for your answer @derekjchow. You have put me on the right track. Indeed it is a RAM issue. I monitored the RAM during the training and we were reaching out the 7.5G available. Even with your modification it is still not working. I resized the machine to 16CPU and 104G of RAM, and it worked. But the training is very greedy and use 84G of RAM at the beginning and is stable around 87G after some raising_pool_size_limit as it goes through the first iterations:

2017-06-30 10:29:21.269187: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 2110 get requests, put_count=2063 evicted_count=1000 eviction_rate=0.484731 and unsatisfied allocation rate=0.507109
2017-06-30 10:29:21.269322: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 256 to 281

I tried to modify the queue_capacity and min_after_dequeue to 500/250, 100/50 and even 2/1 but it doesn't seem to impact the RAM usage during the training.

Do you know which parameters I have to tweak to decrease the RAM usage?

oscarorti commented 7 years ago

Hello @derekjchow I have the same problem as @pjeambrun and I'm trying locally with 8-core 16GB. I also changed that parameters and nothing changed, even trying to train with 2 images one for each label. it is possible to manage the number of threads or the amount of RAM?

pls help..

thaiat commented 7 years ago

same issue for me, any help would be awesome. As a comparaison running py faster rcnn does not require that much of RAM.

andreapiso commented 7 years ago

Same issue here, tried to resize images and queues, but my 32GB of RAM get consumed immediately. How to reduce the RAM usage? SSD with mobilenet should not require so much.

mgparada commented 7 years ago

Hi @pjeambrun i'm dealing with the same issue... I don't find a way to limit the RAM and all rcnn consumes everything. Can you help us? I'm trying to retrain de object detector model with my own images, but i didn't find any solution. I monitored the ram while the train starts, but after 30 secs, the kernel kill the process. I also try to train the net with 1 image, and the same happens.. Probably the script is loading the whole net in RAM and this could be the problem.. isn't it?

I was trying to find something to limit the resources but I didn't find anything..

mgparada commented 7 years ago

Hi again guys, we have found a solution changing the batch_size to one. By default this parameter is set to 32, so probably this needs too much RAM. I don't understand why this is consuming this extremely amount of RAM, but you can change this and train a model in a normal environment.

If @pjeambrun knows something about that, we will appreciate any information.

pjeambrun commented 7 years ago

Hi @mgparada nothing new on my side. I do also think that the problem is coming from the input reader but if you take a look at its config file there are no other relevant parameters for the RAM usage. For now the only solution I have is to manually reduce the image size and annotation files to make the training work on a smaller google compute instance.

jch1 commented 7 years ago

I got a bit confused about whether people are training SSD or Faster R-CNN on this thread, but what I will say is that the batch_size in our API refers to batch size per worker, so in typical Faster R-CNN training scenarios where you resize an image so that it's smallest dimension is 600 pixels, the largest batch size possible is no greater than 1 or 2. With our default SSD configs, we resize to 300x300 or so, and in this case we train with a larger batch size, usually ranging between 16 and 32.

andreapiso commented 6 years ago

@jch1 for mobile net SSD even batch size 1 consumes way more RAM than it should. On my machine with 32 GB of RAM, those get consumed immediately (alongside 20GB of swap mem). There really should not be the need for so much CPU RAM when training on GPU...

jch1 commented 6 years ago

@AndreaPisoni - I'm pretty surprised that this is happening as I don't see the same behavior when I train. Can you say what size you resized the images to?

Arykelton commented 6 years ago

It was the same with me, could anyone solve the problem please?

System: Ubuntu 16.04 64 bits Intel core i7 12 gb ram 2 gb gpu nvidia Pip intall Tensorflow

Sent from my Motorola XT1058 using FastHub

andreapiso commented 6 years ago

@jch1 Here, this is something I really do not understand. Why do I need to resize the input images if I already have a low-ish input size (300x300)? The pipeline should be:

I do not want to resize images "a priori" as the images have small objects inside of them. I assumed this was the pipeline, but maybe it is not? Do the images get enqueued at full res?

pjeambrun commented 6 years ago

Hi @jch1, thank you for your answer.

An easy way to reproduce this issue: Follow the Training an object detector using Cloud Machine Learning Engine (locally), but with bigger images, around 1400/1950 for instance. This should consume more than 25G of RAM even with very low queue_capacity/min_after_dequeue parameters.

System Information: Google Compute Engine

drpngx commented 6 years ago

@pjeambrun so you're getting OOM even with batch size of 1?

drpngx commented 6 years ago

BTW, jemalloc has a memory heap profiler. Could you try that and see where the memory goes?

pjeambrun commented 6 years ago

Hi @drpngx, The batch size is set to default as it is in the faster_rcnn_resnet101_pets.config:

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {

And the training is still consuming too much RAM:

I did these tests on a machine with a 100 G RAM limit.

When I launch the jemalloc heap profiling on the training with images of size 1400/1900, the whole 100 G of RAM are consumed during the initialization and the machine goes OOM.

MALLOC_CONF=prof_leak:true,lg_prof_sample:0,prof_final:true \
LD_PRELOAD=/usr/local/lib/libjemalloc.so.2 python object_detection/train.py --train_dir output_xxxx_jemalloc --pipeline_config_path data_xxxx/faster_rcnn_resnet101_pets.config         

The output

INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
thaiat commented 6 years ago

May this have to do with the queue_capacity and min_after_dequeue parameters in https://github.com/tensorflow/models/blob/master/object_detection/protos/input_reader.proto ?

aspilotros commented 6 years ago

Hi guys, same issue here.

OS Platform and Distribution: Linux Ubuntu 16.04 LTS TensorFlow installed from: pip tensorflow-gpu TensorFlow version 1.3.0 CUDA/cuDNN version: Cuda 8.0, Cudnn 5.1 GPU model and memory: GeForce GTX 1080 Ti 1080, Memory 11 G

I run the training locally using Faster R-CNN resnet 101 starting from a pretrained network. My images are 1280 x 720 pixels. The training consumes all my 8 Gb of ram plus 5 Gb of my swap memory.

I tried already reducing the queue capacity and it did not work. train_input_reader: { tf_record_input_reader { input_path: "/home/alessandro/Tf_GoogleAPI/models/object_detection/data/checkout_train.record" } label_map_path: "/home/alessandro/Tf_GoogleAPI/models/object_detection/data/checkout_label_map.pbtxt" queue_capacity: 500 min_after_dequeue: 250 }

Here is the running of the first step

INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead. INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead. /home/alessandro/Tf_GoogleAPI/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:95: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " 2017-08-23 11:44:44.463435: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-08-23 11:44:44.463448: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-08-23 11:44:44.463451: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-08-23 11:44:44.463453: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-08-23 11:44:44.463456: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2017-08-23 11:44:44.638905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2017-08-23 11:44:44.640117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate (GHz) 1.582 pciBusID 0000:01:00.0 Total memory: 10.91GiB Free memory: 10.49GiB 2017-08-23 11:44:44.751337: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x1435fa50 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that. 2017-08-23 11:44:44.751608: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2017-08-23 11:44:44.751919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate (GHz) 1.582 pciBusID 0000:02:00.0 Total memory: 10.91GiB Free memory: 10.76GiB 2017-08-23 11:44:44.752536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1 2017-08-23 11:44:44.752542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y Y 2017-08-23 11:44:44.752545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1: Y Y 2017-08-23 11:44:44.752549: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0) 2017-08-23 11:44:44.752552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0) INFO:tensorflow:Restoring parameters from object_detection/models/checkout/model.ckpt-1672 2017-08-23 11:44:45.444359: I tensorflow/core/common_runtime/simple_placer.cc:697] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0 INFO:tensorflow:Starting Session. INFO:tensorflow:Saving checkpoint to path object_detection/models/checkout/model.ckpt INFO:tensorflow:Starting Queues. INFO:tensorflow:global_step/sec: 0 INFO:tensorflow:Recording summary at step 1674. INFO:tensorflow:global step 1675: loss = 0.2660 (14.730 sec/step)

Any solution emerged so far except resizing images? If not what is then the image size "suggested"? Thank you for your help, AS

mtourne commented 6 years ago

Only 8GB or ram here.

I haven't tried resizing the images yet, but I imagine resizing them all to the dimension specified in the image_resizer part of the model would do the trick. In my scenario :

    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }

I've been playing with very low settings for the queue

  queue_capacity: 2
  min_after_dequeue: 1
  num_readers: 2

Would there be any side effects of doing that ? (besides the ability to shuffle the training samples).

ArjanSchouten commented 6 years ago

System information

I had the same problem. I have 16 GB of RAM. When running train.py, RAM is 100% after starting the queue_runners. After this line in learning.py RAM is 100%.

After a long debugging session I noticed that the solution is very simple. People suggesting to change queue_capacity are only talking about reducing the filename_queue capacity. In fact it is just a queue of strings. That could not be the reason for OOM. In train_config there are the values:

  1. batch_queue_capacity: default 600
  2. num_batch_queue_threads: default 8
  3. prefetch_queue_capacity: default 10

Shrinking the batch_queue_capacity was the solution for me. If you tamper a bit with that value you can find a value which makes it possible to leave some free memory so you can run eval.py alongside train.py. I don't notice any performance loss since the batch_queue is way to large for me (max_dimension is 1024).

My train_config looks like this now:

train_config: {
  batch_size: 1
  ...
  batch_queue_capacity: 150
  num_batch_queue_threads: 8
  prefetch_queue_capacity: 10
}
hroser commented 6 years ago

Hi thank you very much, your discussion regarding the OOM problem helped me a lot. The train_config settings from @ArjanSchouten solved my problem.

But i hope someone can help me with another question, in the Google Cloud ML documentation i can't find details about the RAM configuration of the different machine types that are available:

standard large_model complex_model_s
complex_model_m
complex_model_l standard_gpu
complex_model_m_gpu complex_model_l_gpu

In https://cloud.google.com/ml-engine/docs/concepts/training-overview#machine_type_table it only tells the rough "t-shirt" sizes of CPU and GPU.

Does someone know the approximate RAM size of the machines? Only a magnitude.

Or can someone tell me how to find out how many RAM (absolute value) is consumed by the master/workers. The stackdriver monitor only shows percentage values...

Many thanks!

MossMcLaughlin commented 6 years ago

I had the same issue running a CNN on large image files ( ~10MB each). If I keep certain hyper-parameters low I can avoid the OOM (out of memory) error. As far as I can tell what has an impact is:

batch_size ( I have mine at 4).
number of filters in my conv net ( I have 8, want to have more).
number of preprocess threads for creating and shuffling batches ( I have 16, and am experimenting if lowering this will decrease performance).

I have not noticed a change from varying my 'min_after_dequeue' or 'capacity' values (args of tf.train.shuffle_batch() ), please correct me here if you think I'm wrong.

Also were people able to resize their images to increase performance? I am not sure if resizing will help my model run better or not. It seems odd (as in slower and less accurate?) to decrease resolution before passing images through for convolution.

tuobay commented 6 years ago

@ArjanSchouten Hi, ArjanSchouten, I use the config file 'faster_rcnn_resnet101_voc07.config' which doesn't has the three lines of code you mentioned. So I add these lines of code batch_queue_capacity: 150 num_batch_queue_threads: 8 prefetch_queue_capacity: 10 in 'train_config' filed as you say. However, the training process still consumes almost all of memory of my GPU. Only the 'num_classes' and the paths of the config file is modified. The batch_size is only set to 1 over which will cause GPU OOM.

System information What is the top-level directory of the model you are using: object_detection Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 TensorFlow installed from (source or binary): source TensorFlow version (use command below): 1.3.0 gpu CUDA/cuDNN version: 8.0/ 6.0 GPU model and memory: Nvidia Tesla K80 12GB

Is there any way to reduce the GPU memory usage and increase the batch_size? pls help.. Many thanks!

ArjanSchouten commented 6 years ago

@tuobay it depends on various parameters. I don't think it is a bug and not related to this issue. This issue is about RAM OOM and not VRAM OOM. Probably you can post it on stackoverflow with your complete config and information about the dataset (e.g. image size) you are using.

ghost commented 6 years ago

@tuobay What is the size of your images? The standard model does not handle images with more than 1000*1000 pixels in standard GPU.

tuobay commented 6 years ago

@ArjanSchouten thank you for replying. @madekwe The image size is about 480*600. I have a big question. Whatever the model I select from the object_detection repo, the training process always consumes the almost the same GPU memory which almost equals to the whole GPU memory size. I try to modify several parameters, but it doesn't work. Besides, with GPU k80, the process consumes about 11/12g of GPU memory, with GPU M60, the process consumes about 7/8g of GPU memory. Whatever the GPU and the model I select, the training process always occupy almost all memory.

I will try https://github.com/tensorflow/models/issues/2703 .

ArjanSchouten commented 6 years ago

@tuobay it is normal that tensorflow is allocating nearly all available gpu memory. I suggested a configuration option in this PR for object_detection: https://github.com/tensorflow/models/pull/2341

You can indeed configure the per_process_gpu_memory_fraction which should work.

tuobay commented 6 years ago

@ArjanSchouten thank you a lot! I have tried https://github.com/tensorflow/models/issues/1854 . However, it doesn't work. I add these lines in trainer.py session_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False) session_config.gpu_options.allow_growth=True session_config.gpu_options.per_process_gpu_memory_fraction = 0.8

I still try session_config=tf.ConfigProto( gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.2)) It still doesn't work

ghost commented 6 years ago

@tuobay

try to reduce the batch size my config :

batch_size: 1 batch_queue_capacity: 50

your line in the trainer.py works fine for me

oldsqlwnb commented 6 years ago

try to pass your batch to a tensor before using it in your network. I had the same issue, when I passed the labels_batch directly from the queue into the softmax_cross_entropy function. I used to pass flattened 2048*2048 as binary labels into this function. With 8 GB Ram, it ran quickly out of memory, but as I passed the labels into atf.Variableand used tf.reshape to flatten the image, the Ram usage was stable at around 3.2 GB. Seems like without that tensorflow does not know how to free the memory.

ronykalfarisi commented 6 years ago

Hi all, I have similar problem. So, I followed @ArjanSchouten solution and the code ran. However, after running around 6000 steps, I still got the OOM error

INFO:tensorflow:global step 6348: loss = 2.0393 (3.988 sec/step)
    INFO:tensorflow:Saving checkpoint to path /home/ubuntu/crack-detection/structure-crack/models/faster_rcnn_nas_coco_2017_11_08/train/model.ckpt
    INFO:tensorflow:global step 6349: loss = 0.9803 (3.980 sec/step)
    2018-01-25 05:51:25.959402: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 79.73MiB.  Current allocation summary follows.
    ...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[64,17,17,4032]
     [[Node: MaxPool2D/MaxPool = MaxPool[T=DT_FLOAT, data_format="NHWC", ksize=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1], 
...

EDIT: The errors show up more frequently now, it'll show after 1000-1500 steps. Please help

PythonImageDeveloper commented 6 years ago

@derekjchow ,I am a new in Tensorflow , i saw these queue_capacity and min_after_dequeue at the input_reader.proto in the object_detection/protos (object detection api) path. First, I wanted to know the meaning of these two, and what could the role and effect of the two be? in the CPU/GPU ? why ? What are the high and low the effect of these? please more explain about these.

mpsdskd commented 6 years ago

@oldsqlwnb I'm also having problems with my mere 8GB of RAM, cold you please explain how to do that?

I used to pass flattened 2048*2048 as binary labels into this function. With 8 GB Ram, it ran quickly out of memory, but as I passed the labels into atf.Variableand used tf.reshape to flatten the image

oldsqlwnb commented 6 years ago

@mpsdskd effectively I ended up doing this: reshaped_output = tf.reshap(output,[1,image_width*image_hight] reshaped_ground_truth = tf.reshape(ground_truth,[1,image_width*image_hight]

I didn't even needed to use a variable. I fed those two tensors into the tf.losses.sigmoid_cross_entropy function.

However I only have a very shallow neural network(one conv layer and one deconv layer), if I try to increase the depth I again run out of memory very quickly.

yuezhilanyi commented 6 years ago

with GTX 1060 6G and 16G RAM, configs like this works well: batch_size: 1 batch_queue_capacity: 100 num_batch_queue_threads: 8 prefetch_queue_capacity: 10

thanks @ArjanSchouten

douglasrizzo commented 6 years ago

I am trying to train a pre-trained MobileNet v1 using the configurations recommended by @ArjanSchouten and @yuezhilanyi . I've tried changing the parameters related to the batch size and queue capacity, but the behavior seems to always be the same. It comes to a point where my RAM and swap space are full and the training process is killed. I blindly tried the following values:

batch_size: 1
batch_queue_capacity: 10
num_batch_queue_threads: 4
prefetch_queue_capacity: 5

and I was able to train for 116 steps until the process was killed again. Any tips?

radzfoto commented 6 years ago

I had the same problem using object_detection/train.py retraining faster_rcnn_inception_resnet_v2 with TensorFlow 1.8. I tried all the suggestions from @ArjanSchouten and @yuezhilanyi. Nothing worked. On a hunch, in the pipeline.config file, I added num_readers: 1 to the section train_input_reader After this, everything worked fine. If anyone can explain why I can only use a single reader, I would appreciate it. My images are large (full HD 1920x1080, but I have the resizer set to 1024 max and 600 min). I am using Ubuntu 16.04 with 32GB of CPU memory and an nVidia Titan Xp with 12GB GPU memory. Thanks.

klango commented 6 years ago

So my experience related to this problem. I hope it helps someone havingsuch issue.

I tackled (lot of) different Models configs and parameters in the .config files.

I finally managed to get the RAM steady around 30GB used (but no Killed!) with the following config:

train_config: {
  batch_size: 1
        ...
        batch_queue_capacity: 60
        num_batch_queue_threads: 30
        prefetch_queue_capacity: 40
train_input_reader: {
   ...
  queue_capacity: 2
  min_after_dequeue: 1
  num_readers: 1
karansomaiah commented 5 years ago

@ArjanSchouten's answer is perfect. If some of you'll are still facing the issue, try to increase your swap space. This worked for me. If you have a CPU with 8 GB RAM it makes sense to have a swap mem. of about 12 GB (or a little more). But do realize do not add a lot of swap since it can actually slow down the computations.

thanhnt1995 commented 5 years ago

Config batch size it worked with my issue! In my first .config file, batch size is 24. After, I fix it to 16 , and my training is still crashed. Finally, I fix it down to 8 and I check ram is down. The training is very good! But loss on every step is bigger. Training is take more time, more step to finish.

ArjanSchouten commented 5 years ago

@thanhnt1995 glad it is working now. You should tune your learning rate I think.

GraceBoston commented 5 years ago

So my experience related to this problem. I hope it helps someone havingsuch issue.

I tackled (lot of) different Models configs and parameters in the .config files.

  • Tensorflow 1.5
  • Only CPU
  • 32GB RAM
  • around 1000 Training Images, where most 4900x3600 pixels (huge)
  • Model used: faster_rcnn_resnet50_coco
  • Nr. of classes: 4

I finally managed to get the RAM steady around 30GB used (but no Killed!) with the following config:

train_config: {
  batch_size: 1
        ...
        batch_queue_capacity: 60
        num_batch_queue_threads: 30
        prefetch_queue_capacity: 40
train_input_reader: {
   ...
  queue_capacity: 2
  min_after_dequeue: 1
  num_readers: 1

It's working! my RAM steady around 30GB used as well. Thanks @klango

Tato14 commented 5 years ago

So, using @klango approach, how much time does it take to train? Did you try to use more than one CPU in parallel? Or any other updates on this?

klango commented 5 years ago

So, using @klango approach, how much time does it take to train? Did you try to use more than one CPU in parallel? Or any other updates on this?

I had 8 CPUs ~1000 images took me about 10 days to get 30k steps. Very good results ~92 - 99% right recognition.

Mubashshira commented 5 years ago

but its showing this error:

Traceback (most recent call last): File "train.py", line 184, in tf.app.run() File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/absl/app.py", line 300, in run _run_main(main, args) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func return func(*args, **kwargs) File "train.py", line 93, in main FLAGS.pipeline_config_path) File "/usr/local/lib/python2.7/dist-packages/object_detection-0.1-py2.7.egg/object_detection/utils/config_util.py", line 98, in get_configs_from_pipeline_file text_format.Merge(proto_str, pipeline_config) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 685, in Merge allow_unknown_field=allow_unknown_field) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 752, in MergeLines return parser.MergeLines(lines, message) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 777, in MergeLines self._ParseOrMerge(lines, message) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 799, in _ParseOrMerge self._MergeField(tokenizer, message) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 924, in _MergeField merger(tokenizer, message, field) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 998, in _MergeMessageField self._MergeField(tokenizer, sub_message) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 924, in _MergeField merger(tokenizer, message, field) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 998, in _MergeMessageField self._MergeField(tokenizer, sub_message) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 924, in _MergeField merger(tokenizer, message, field) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 998, in _MergeMessageField self._MergeField(tokenizer, sub_message) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 924, in _MergeField merger(tokenizer, message, field) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 998, in _MergeMessageField self._MergeField(tokenizer, sub_message) File "/home/mubashshira_farooqui2000/.local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 891, in _MergeField (message_descriptor.full_name, name)) google.protobuf.text_format.ParseError: 89:9 : Message type "object_detection.protos.LearningRate" has no field named "batch_queue_capacity".

VitalieStirbu commented 4 years ago

@Mubashshira you've added "batch_queue_capacity" in the wrong place, make sure it's right under "batch_size" property

VitalieStirbu commented 4 years ago

I also get out of memory exception: Cuda 10.0.0 Cudnn 7.6.1 Tensorflow-GPU: 1.14.0 Windows 10 NVIDIA GeForce GTX 1070 Ti 8GB

90 images (all 970x730) 1 class Model: faster_rcnn_nass_coco

Tried different configurations, but it fails even when everything is on low Config: `traing_config: { batch_size: 1 ... batch_queue_capacity: 1 num_batch_queue_threads: 1 prefetch_queue_capacity: 1 }

train_input_reader: { ... queue_capacity: 1 min_after_dequeue: 1 num_readers: 1 }`

liuchangf commented 4 years ago

@VitalieStirbu

I also get out of memory exception: Cuda 10.0.0 Cudnn 7.6.1 Tensorflow-GPU: 1.14.0 Windows 10 NVIDIA GeForce GTX 1070 Ti 8GB

90 images (all 970x730) 1 class Model: faster_rcnn_nass_coco

Tried different configurations, but it fails even when everything is on low Config: `traing_config: { batch_size: 1 ... batch_queue_capacity: 1 num_batch_queue_threads: 1 prefetch_queue_capacity: 1 }

train_input_reader: { ... queue_capacity: 1 min_after_dequeue: 1 num_readers: 1 }`

You can try my solution as following: https://github.com/tensorflow/models/issues/5296#issuecomment-501084212

adoku14 commented 4 years ago

So my experience related to this problem. I hope it helps someone havingsuch issue.

I tackled (lot of) different Models configs and parameters in the .config files.

  • Tensorflow 1.5
  • Only CPU
  • 32GB RAM
  • around 1000 Training Images, where most 4900x3600 pixels (huge)
  • Model used: faster_rcnn_resnet50_coco
  • Nr. of classes: 4

I finally managed to get the RAM steady around 30GB used (but no Killed!) with the following config:

train_config: {
  batch_size: 1
        ...
        batch_queue_capacity: 60
        num_batch_queue_threads: 30
        prefetch_queue_capacity: 40
train_input_reader: {
   ...
  queue_capacity: 2
  min_after_dequeue: 1
  num_readers: 1

I tried everything and nothing worked for me except the solution from @klango and @radzfoto. Thnx guys!