Closed liasece closed 1 month ago
@liasece Memory error indicates that the process has consumed all of the ram memory. You may want to reduce the batch size/image size and try again.
@liasece Memory error indicates that the process has consumed all of the ram memory. You may want to reduce the batch size/image size and try again.
This is what is not normal, I have enough memory in my system, you should probably pay attention to what I am describing.
@liasece I ran the code shared and face a different error, please find gist here. please share a colab gist with the error reported.
@Saduf2019 This problem does not seem to be reproducible in your colab environment, so perhaps you need Windows and a GPU, properly tuned for frozen_batch_size, to reproduce the problem.
This is what is not normal, I have enough memory in my system
When it says "Unable to allocate 184. MiB", it means that it was not able to allocate 184M on top of everything it has allocated. So this isn't surprising by itself -- imagine that all other tensors consume 5G-10M and so the 184M memory allocation fails.
Do you have other reasons to believe that the 184M allocation should succeed?
This is what is not normal, I have enough memory in my system
When it says "Unable to allocate 184. MiB", it means that it was not able to allocate 184M on top of everything it has allocated. So this isn't surprising by itself -- imagine that all other tensors consume 5G-10M and so the 184M memory allocation fails.
Do you have other reasons to believe that the 184M allocation should succeed?
Indeed, I can't rule out that my performance monitoring tool is missing memory usage samples and that the real graph may have spikes that it doesn't catch.
I'm getting a ResourceExhaustedError in the middle of a training and I'm assuming that shouldn't be possible. I.e., either it happens in the very first iteration, showing me my GPU doesn't have enough memory to train my model or it works until the end.
The reality is that my RAM and GPU RAM footprint is flat during training, and my inputs are the same size for each training step, so there should be no particular step that requires a particular amount of memory.
It should not fail with 70GB (70%) of RAM and 5GB (20%) of GPU RAM remaining on my system.
I'm getting a ResourceExhaustedError in the middle of a training and I'm assuming that shouldn't be possible.
That is surprising, but it is not a logical impossibility. Maybe there is a memory leak somewhere, have you tried using the TF memory profiler?
As for why TF is OOMing with 70G left of CPU memory left over, can you try running with TF_CPP_VMODULE=<module_name>=process_state=2
and attach the logs? That might give us a clue.
CC @ezhulenev in case has seen this before.
I am having a similar situation on LSTMs, however I (knowing it might be bad idea): I lowered my sequence length from 32 -> 16 -> 8 -> 4, 4 finally gets me to epoch 60 and beyond (still running)
It was certainly a RAM (16GB) resource exhaust.
It's interesting to see that 70GB still managed to resource exhaust.
Has this problem been solved yet?
The problem still presists
I face a similar issue while training the Convolutional BiLSTM2D network with 80GB RAM and 90 GB GPU RAM. Has the issue been resolved?
Hi,
Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.
The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.
Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space.
This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.
This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.
For similar questions see: #38414
System information
pip install tensorflow-gpu==2.4.0rc3
Describe the current behavior
An error occurs when training to the second epoch,
MemoryError: Unable to allocate 184. MiB for an array with shape (64, 26, 26, 3, 371) and data type float32
When the problem occurred, I had 70GB of RAM and 5GB of video memory left on my system. But "Unable to allocate 184. MiB". To hide this problem, simply reduce frozen_batch_size from 64 to 32, i.e., reduce the batch size.Describe the expected behavior
There should be no errors.
Standalone code to reproduce the issue For code and data, please see: https://github.com/liasece/tf-38414
Other info / logs
15/15 [==============================] - ETA: 0s - loss: 7711.74822020-11-28 09:40:12.434912: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:592] layout failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1) 2020-11-28 09:40:12.449082: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:592] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1) 2020-11-28 09:40:12.582863: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:592] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1) 2020-11-28 09:40:14.685870: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,647,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.686045: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.686110: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.686801: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.687007: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,826,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.686816: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.687972: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,769,512,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.689583: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,768,512,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.689721: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,678,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.690045: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,768,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.690171: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,683,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.772685: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,770,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.775723: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,640,640,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.776161: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.777899: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,776,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.782664: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,683,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.791474: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,512,932,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.791955: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.792808: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.793188: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,559,512,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.793469: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cast_op.cc:109 : Resource exhausted: OOM when allocating tensor with shape[1,640,494,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.793868: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at image_resizer_state.h:142 : Resource exhausted: OOM when allocating tensor with shape[1,416,416,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu 2020-11-28 09:40:14.816077: W tensorflow/core/framework/op_kernel.cc:1751] Resource exhausted: MemoryError: Unable to allocate 184. MiB for an array with shape (64, 26, 26, 3, 371) and data type float32 Traceback (most recent call last):
File "R:\ProgramData\Anaconda3\envs\ml\lib\site-packages\tensorflow\python\ops\script_ops.py", line 247, in call return func(device, token, args)
File "R:\ProgramData\Anaconda3\envs\ml\lib\site-packages\tensorflow\python\ops\script_ops.py", line 135, in call ret = self._func(*args)
File "R:\ProgramData\Anaconda3\envs\ml\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 620, in wrapper return func(*args, **kwargs)
File "R:\ml\bug-test\xyolo\yolo3\utils.py", line 358, in preprocess_true_boxes_xyolo dtype='float32') for l in range(num_layers)]
File "R:\ml\bug-test\xyolo\yolo3\utils.py", line 358, in
dtype='float32') for l in range(num_layers)]
MemoryError: Unable to allocate 184. MiB for an array with shape (64, 26, 26, 3, 371) and data type float32