tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
186.08k stars 74.27k forks source link

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[30003] and type half on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc #24537

Closed monjoybme closed 5 years ago

monjoybme commented 5 years ago

I am working on Sparse autoencoder model which have 15 convolution layers and 21 transpose convolution layers. I am running my code in a multi GPU system. This code is running well in the small dataset, but I am getting OMM resource exhausted error when running on a huge dataset. I changed the batch size to 8 but still facing same error. Any help will be appreciated.

Traceback: [[[[Node: tower_1/DecodeRaw/_193 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_15_tower_1/DecodeRaw", tensor_type=DT_HALF, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Exception in thread QueueRunnerThread-tower_1/shuffle_batch/random_shuffle_queue-tower_1/shuffle_batch/random_shuffle_queue_enqueue: Traceback (most recent call last): File "/opt/rh/rh-python36/root/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/opt/rh/rh-python36/root/usr/lib64/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run enqueue_callable() File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1205, in _single_operation_run self._call_tf_sessionrun(None, {}, [], target_list, None) File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[30003] and type half on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc [[Node: tower_1/DecodeRaw = DecodeRawlittle_endian=true, out_type=DT_HALF, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Node: tower_1/DecodeRaw/_193 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_15_tower_1/DecodeRaw", tensor_type=DT_HALF, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.](url) ](url)

System information Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux (Centos 7) TensorFlow installed from (source or binary): Binary TensorFlow version (use command below): 1.10 Python version: 3.6.5_1 Bazel version (if compiling from source): N/A GCC/Compiler version (if compiling from source): N/A CUDA/cuDNN version: Cuda compilation tools, release 9.0, V9.0.176 GPU model and memory: NVIDIA TITAN V (4 GPUs) Exact command to reproduce: (see above)

msymp commented 5 years ago

Hello, Your template looks quite OK and you are running from r1.10 binaries on Linux. Also try cuDNN 7.0 when you tweak the batch sizes, just to eliminate dependancies.

OOM errors are generally associated with TensorFlow's tendency to greedily allocate all GPU memory to new sessions in the order they were created until exhaustion. You may need to configure TensorFlow as suggested by Scott in this Stack Overflow query: https://stackoverflow.com/questions/51310257/tensorflow-gpu-python-resource-exhausted-error-in-cluster Please let us know your Sparse autoencoder's progress. Thanks.

monjoybme commented 5 years ago

I already applied above suggestions but no improvement still facing this issue.

[InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'tower_0/input_producer/Assert/Assert': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. Registered kernels: device='CPU](url)

monjoybme commented 5 years ago

My code snippet:

config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=FLAGS.log_device_placement) config.gpu_options.per_process_gpu_memory_fraction = 0.4 sess = tf.Session(config=config) sess.run(init)

Error Using TensorFlow backend. ERROR:tensorflow:Exception in QueueRunner: Dst tensor is not initialized. [[Node: tower_0/per_image_standardization/_185 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_48_tower_0/per_image_standardization", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] ERROR:tensorflow:Exception in QueueRunner: Dst tensor is not initialized. [[Node: tower_3/per_image_standardization/_223 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:3", send_device_incarnation=1, tensor_name="edge_48_tower_3/per_image_standardization", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] ERROR:tensorflow:Exception in QueueRunner: Dst tensor is not initialized. [[Node: tower_1/per_image_standardization/_191 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_48_tower_1/per_image_standardization", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] ERROR:tensorflow:Exception in QueueRunner: Dst tensor is not initialized. [[Node: tower_2/per_image_standardization/_207 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:2", send_device_incarnation=1, tensor_name="edge_48_tower_2/per_image_standardization", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Any help will be highly appreciated.

msymp commented 5 years ago

Hello, The other two errors are not aligned with the original OOM error. Please use Stack Overflow to investigate various errors and track down the solutions being thrown by the autoencoder code. https://stackoverflow.com/search?q=tensorflow