Resource exhausted: MemoryError: Unable to allocate

liasece commented 3 years ago

For similar questions see: #38414

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
TensorFlow installed from (source or binary): pip install tensorflow-gpu==2.4.0rc3
TensorFlow version (use command below): v2.4.0-rc2-20-g68f236364c 2.4.0-rc3
Python version: 3.7.9
CUDA/cuDNN version: CUDA 11.1
GPU model and memory: GeForce RTX 3090 24GiB

Describe the current behavior

An error occurs when training to the second epoch, MemoryError: Unable to allocate 184. MiB for an array with shape (64, 26, 26, 3, 371) and data type float32 When the problem occurred, I had 70GB of RAM and 5GB of video memory left on my system. But "Unable to allocate 184. MiB". To hide this problem, simply reduce frozen_batch_size from 64 to 32, i.e., reduce the batch size.

Describe the expected behavior

There should be no errors.

Standalone code to reproduce the issue For code and data, please see: https://github.com/liasece/tf-38414

Other info / logs

15/15 [==============================] - ETA: 0s - loss: 7711.74822020-11-28 09:40:12.434912: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:592] layout failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1) 2020-11-28 09:40:12.449082: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:592] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1) 2020-11-28 09:40:12.582863: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:592] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1) 2020-11-28 09:40:14.685870: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,647,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.686045: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.686110: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.686801: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.687007: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,826,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.686816: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.687972: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,769,512,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.689583: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,768,512,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.689721: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,678,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.690045: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,768,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.690171: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,683,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.772685: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,770,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.775723: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,640,640,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.776161: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.777899: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,776,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.782664: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,683,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.791474: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,932,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.791955: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.792808: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.793188: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,559,512,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.793469: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,640,494,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.793868: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.816077: W tensorflow/core/framework/op_kernel.cc:1751] Resource exhausted: MemoryError: Unable to allocate 184. MiB for an array with shape (64, 26, 26, 3, 371) and data type float32 Traceback (most recent call last):

File "R:\ProgramData\Anaconda3\envs\ml\lib\site-packages\tensorflow\python\ops\script_ops.py", line 247, in call return func(device, token, args)

File "R:\ProgramData\Anaconda3\envs\ml\lib\site-packages\tensorflow\python\ops\script_ops.py", line 135, in call ret = self._func(*args)

File "R:\ProgramData\Anaconda3\envs\ml\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 620, in wrapper return func(*args, **kwargs)

File "R:\ml\bug-test\xyolo\yolo3\utils.py", line 358, in preprocess_true_boxes_xyolo dtype='float32') for l in range(num_layers)]

File "R:\ml\bug-test\xyolo\yolo3\utils.py", line 358, in dtype='float32') for l in range(num_layers)]

MemoryError: Unable to allocate 184. MiB for an array with shape (64, 26, 26, 3, 371) and data type float32

Saduf2019 commented 3 years ago

@liasece Memory error indicates that the process has consumed all of the ram memory. You may want to reduce the batch size/image size and try again.

liasece commented 3 years ago

@liasece Memory error indicates that the process has consumed all of the ram memory. You may want to reduce the batch size/image size and try again.

This is what is not normal, I have enough memory in my system, you should probably pay attention to what I am describing.

Saduf2019 commented 3 years ago

@liasece I ran the code shared and face a different error, please find gist here. please share a colab gist with the error reported.

liasece commented 3 years ago

@Saduf2019 This problem does not seem to be reproducible in your colab environment, so perhaps you need Windows and a GPU, properly tuned for frozen_batch_size, to reproduce the problem.

sanjoy commented 3 years ago

This is what is not normal, I have enough memory in my system

When it says "Unable to allocate 184. MiB", it means that it was not able to allocate 184M on top of everything it has allocated. So this isn't surprising by itself -- imagine that all other tensors consume 5G-10M and so the 184M memory allocation fails.

Do you have other reasons to believe that the 184M allocation should succeed?

liasece commented 3 years ago

This is what is not normal, I have enough memory in my system

When it says "Unable to allocate 184. MiB", it means that it was not able to allocate 184M on top of everything it has allocated. So this isn't surprising by itself -- imagine that all other tensors consume 5G-10M and so the 184M memory allocation fails.

Do you have other reasons to believe that the 184M allocation should succeed?

Indeed, I can't rule out that my performance monitoring tool is missing memory usage samples and that the real graph may have spikes that it doesn't catch.

I'm getting a ResourceExhaustedError in the middle of a training and I'm assuming that shouldn't be possible. I.e., either it happens in the very first iteration, showing me my GPU doesn't have enough memory to train my model or it works until the end.

The reality is that my RAM and GPU RAM footprint is flat during training, and my inputs are the same size for each training step, so there should be no particular step that requires a particular amount of memory.

It should not fail with 70GB (70%) of RAM and 5GB (20%) of GPU RAM remaining on my system.

sanjoy commented 3 years ago

I'm getting a ResourceExhaustedError in the middle of a training and I'm assuming that shouldn't be possible.

That is surprising, but it is not a logical impossibility. Maybe there is a memory leak somewhere, have you tried using the TF memory profiler?

As for why TF is OOMing with 70G left of CPU memory left over, can you try running with TF_CPP_VMODULE=<module_name>=process_state=2 and attach the logs? That might give us a clue.

CC @ezhulenev in case has seen this before.

alpha-256 commented 3 years ago

I am having a similar situation on LSTMs, however I (knowing it might be bad idea): I lowered my sequence length from 32 -> 16 -> 8 -> 4, 4 finally gets me to epoch 60 and beyond (still running)

It was certainly a RAM (16GB) resource exhaust.

It's interesting to see that 70GB still managed to resource exhaust.

Has this problem been solved yet?

Sultan91 commented 3 years ago

The problem still presists

MMK7492 commented 1 year ago

I face a similar issue while training the Convolutional BiLSTM2D network with 80GB RAM and 90 GB GPU RAM. Has the issue been resolved?

Venkat6871 commented 2 months ago

Hi,

Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.

The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.

Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 1 month ago

Are you satisfied with the resolution of your issue? Yes No

tensorflow / tensorflow

Resource exhausted: MemoryError: Unable to allocate #45238