tensorflow / models

Models and examples built with TensorFlow
Other
77k stars 45.79k forks source link

Resource exhausted: OOM when allocating tensor with shape[32,960,10,10] #8487

Open sainisanjay opened 4 years ago

sainisanjay commented 4 years ago

Error got during object detection training with ssd_mobilenet_v2_quantized_300x300_coco model. I am running below command to start the training: python ../../models/research/object_detection/model_main.py --pipeline_config_path=./ssd_mobilenet_v2_quantized_300x300_coco.config --model_dir=./training/ --num_train_steps=2000000 --sample_1_of_n_eval_examples=1 --alsologtostderr Training was going fine till 47900 steps, after that i got error:

File "/home/saini/.virtualenvs/cv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[32,960,10,10] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node gradients/AddN_162-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Loss/Cast_232/_16919]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[32,960,10,10] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node gradients/AddN_162-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

OS: Ubuntu 18.04 Tensorflow: 1.14.0 GPU CUDA: 10.0 CUDNN:: 7.6 Batch Size: 32

Follow changes i have made in default TF object detection API:

model_lib.py
tf.estimator.EvalSpec(
            name=eval_spec_name,
            input_fn=eval_input_fn,
            steps=None,
            throttle_secs = 172800,
            exporters=exporter))
eval.proto
optional uint32 eval_interval_secs = 3 [default = 172800]; # default = 600
model_main.py
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, save_checkpoints_steps=5000)
sainisanjay commented 4 years ago

Note : The error occurred after 47900 steps. My question is why after 47900 steps there is an error. Why not at the initial steps?

VismayTandel commented 4 years ago

OOM means out of memory. Maybe there is a case that at initial steps, memory might be empty, but when your steps go high, it uses more memory. I think that's why it happen.

sainisanjay commented 4 years ago

@VismayTandel Yes you are right: OOM means out of memory. If Images size and Batch size is same throughout the training. So how can be possible later steps training needs more memory. Further, as i am not at all running any other program which takes the GPU memory. That's why for me its quite strange GPU goes out of memory in between training.

alvianihza commented 4 years ago

I change the batch size to 16, then all works good

chudur-budur commented 3 years ago

Same problem here, tried with smaller batch_size and/or learning_rate but still doesn't work. Gives me the same error.

They should give a small example data-sets with a config file with each official model so that we would be at least able to know it's not going anywhere before waiting for 3~4 hours.

ya332 commented 3 years ago

I am getting the same error:

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_generator_v1.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing)
    591         shuffle=shuffle,
    592         initial_epoch=initial_epoch,
--> 593         steps_name='steps_per_epoch')
    594 
    595   def evaluate(self,

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_generator_v1.py in model_iteration(model, data, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch, mode, batch_size, steps_name, **kwargs)
    257 
    258       is_deferred = not model._is_compiled
--> 259       batch_outs = batch_function(*batch_data)
    260       if not isinstance(batch_outs, list):
    261         batch_outs = [batch_outs]

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_v1.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics)
   1086       self._update_sample_weight_modes(sample_weights=sample_weights)
   1087       self._make_train_function()
-> 1088       outputs = self.train_function(ins)  # pylint: disable=not-callable
   1089 
   1090     if reset_metrics:

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
   3955 
   3956     fetched = self._callable_fn(*array_vals,
-> 3957                                 run_metadata=self.run_metadata)
   3958     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   3959     output_structure = nest.pack_sequence_as(

/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1480         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1481                                                self._handle, args,
-> 1482                                                run_metadata_ptr)
   1483         if run_metadata:
   1484           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: OOM when allocating tensor with shape[32768,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training_46/Adam/Adam/update_dense_22/kernel/ResourceApplyAdam}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Does any one have any idea? How come later steps of the training could consume more memory?

jafarMajidpour commented 3 years ago

@VismayTandel Yes you are right: OOM means out of memory. If Images size and Batch size is same throughout the training. So how can be possible later steps training needs more memory. Further, as i am not at all running any other program which takes the GPU memory. That's why for me its quite strange GPU goes out of memory in between training.

thanks, I changed the batch size to 16 and my problem is solved

xsqian commented 2 years ago

I have the same error messages: (I am running tensorflow 2.4 1) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,64,224,896] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model_1/batch_normalization_8/FusedBatchNormV3 (defined at /opt/conda/lib/python3.7/site-packages/mlrun/frameworks/keras/mlrun_interface.py:123) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_train_function_4513]

KaviyaSubramanian706 commented 2 years ago

I had same issue..reduced batch size and tried and it was same ..tried setting allow_growth..still same..on rebooting the systems..the problem went away..but on killing training in between and starting the training again...made the issue rebounce

pranavdurai10 commented 2 years ago

I was trying to train a Vision Transformer on CIFAR-100 dataset.

GPU: GTX 1650 w/ 4GB vRAM

Initially, I had the batch_size set to 256, which was totally insane for such a GPU, and I was getting the same OOM error.

I tweaked it to batch_size = 16, training works perfectly fine.

_So, always choose a smaller batch_size if you are training on laptops or mid-range GPUs_.

Hope this helps!

Pun-it commented 8 months ago

I was trying to train an auto encoder and encountered the same error even with smaller batches, but prefetching the data made the error go away . 😄

febinmathew commented 6 months ago

@sainisanjay Have you found any solution for this issue? I ran a DNN program with both TensorFlow and PyTorch and it works fine with PyTorch but TensorFlow throws an OOM error after a few episodes! The program works fine in TensorFlow if the batch size is reduced to 16, but it feels like a bummer.