memory issues - Githubissues

zcyang commented 8 years ago

Hi,

It seems the memory allocation of tensorflow is rather inefficient. I have been running a single layer rnn with 256 batch size, 124 length and dim of 512, it constantly gets memory not enough error for my 4GB 980. In theory, the model size is much less than 1GB

no matter how large the batch size I set, it always use up all the 4GB memory, which is unreasonable. I have been compiling tensorflow from the source and BFC memory allocator is set as default.

I think the memory problem was also mentioned here https://github.com/soumith/convnet-benchmarks/issues/66 and mentioned by many other users. In compare with Theano and Torch, tensorflow can only experiment with smaller models.

Are there any solutions to this? This is a major problem that stops me from experimenting with tensorflow.

Many thanks!!

NickShahML commented 8 years ago

Hey man, I have had and still have the same issue and asked about it here:

https://github.com/tensorflow/tensorflow/issues/352

You can set the aggregation_method = 2 and that helped me some. But still you're right. Tensorflow sucks up proportionally way too much memory. It has been difficult to deal with. If they could fix this one aspect, it would be a real game changer.

vincentvanhoucke commented 8 years ago

Stay tuned, improving memory usage and management is at the top of our list.

NickShahML commented 8 years ago

Sounds great to hear that. I love Tensorflow and with improved memory usage it would be the best deep learning platform in my opinion.

vrv commented 8 years ago

With hundreds of issues still open, this is too general of a request to be useful to keep open -- we're constantly going to be trying to improve performance and memory, of course.

NickShahML commented 8 years ago

@vrv can you comment if 0.7.0 has improved memory allocation? I have hesitated to upgrade to 0.7.0 due to reported issues with Saver function.

pooyam commented 8 years ago

Hello,

I am using the GPU compatible version of TensorFlow 0.8, my system has 4 GPUs with approximately 12GB memory for each. I am running a CNN in TensorFlow for grey scale images (one channel images) whose size is 1024*1024, the total number of training images is 1520 and the batch size is 7, also this CNN has 64 feature maps for each convolutional layer. According to a variety of web pages talking about TensorFlow's memory issue I am using the BFC memory allocator in the session's configuration of my CNN. Unfortunately after around 10 epochs I am getting the "ran out of memory error", following is the error log, which doesn't make sense to me considering the batch size, image dimensions, and memory size of my GPU. Any help is appreciated.

Error log:

I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (1024): Total Chunks: 1, Chunks in use: 0 1.8KiB allocated for chunks. 256B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (16384): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (8388608): Total Chunks: 1, Chunks in use: 0 8.00MiB allocated for chunks. 400.0KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (33554432): Total Chunks: 1, Chunks in use: 0 36.53MiB allocated for chunks. 400.0KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (134217728): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (268435456): Total Chunks: 2, Chunks in use: 0 2.21GiB allocated for chunks. 896.00MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:652] Bin for 1.75GiB was 256.00MiB, Chunk State: I tensorflow/core/common_runtime/bfc_allocator.cc:658] Size: 508.67MiB | Requested Size: 448.00MiB | in_use: 0, prev: Size: 448.00MiB | Requested Size: 448.00MiB | in_use: 1, next: Size: 1.75GiB | Requested Size: 1.75GiB | in_use: 1 I tensorflow/core/common_runtime/bfc_allocator.cc:658] Size: 1.72GiB | Requested Size: 448.00MiB | in_use: 0, prev: Size: 448.00MiB | Requested Size: 448.00MiB | in_use: 1, next: Size: 6.2KiB | Requested Size: 6.2KiB | in_use: 1 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80000 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80100 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80200 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80300 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80400 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80500 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80600 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80700 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80800 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80900 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80a00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80b00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80c00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80d00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80e00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb80f00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb81000 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb81100 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb81200 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb81300 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb81400 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb81500 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb81600 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb81700 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb81f00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb82000 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb82100 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb82200 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb82300 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb82400 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x230fb82500 of size 6400 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x231200b200 of size 29360128 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x231440b200 of size 469762048 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x239e15e100 of size 6400 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x239e15fa00 of size 469762048 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x23ba15fa00 of size 469762048 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x23f5e0b200 of size 1879048192 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x2465e0b200 of size 1879048192 I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x24d5e0b200 of size 3733947904 I tensorflow/core/common_runtime/bfc_allocator.cc:679] Free at 0x230fb81800 of size 1792 I tensorflow/core/common_runtime/bfc_allocator.cc:679] Free at 0x230fb83e00 of size 38302720 I tensorflow/core/common_runtime/bfc_allocator.cc:679] Free at 0x2313c0b200 of size 8388608 I tensorflow/core/common_runtime/bfc_allocator.cc:679] Free at 0x233040b200 of size 1842687744 I tensorflow/core/common_runtime/bfc_allocator.cc:679] Free at 0x23d615fa00 of size 533379072 I tensorflow/core/common_runtime/bfc_allocator.cc:685] Summary of in-use Chunks by size: I tensorflow/core/common_runtime/bfc_allocator.cc:688] 30 Chunks of size 256 totalling 7.5KiB I tensorflow/core/common_runtime/bfc_allocator.cc:688] 2 Chunks of size 6400 totalling 12.5KiB I tensorflow/core/common_runtime/bfc_allocator.cc:688] 1 Chunks of size 29360128 totalling 28.00MiB I tensorflow/core/common_runtime/bfc_allocator.cc:688] 3 Chunks of size 469762048 totalling 1.31GiB I tensorflow/core/common_runtime/bfc_allocator.cc:688] 2 Chunks of size 1879048192 totalling 3.50GiB I tensorflow/core/common_runtime/bfc_allocator.cc:688] 1 Chunks of size 3733947904 totalling 3.48GiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] Sum Total of in-use chunks: 8.32GiB I tensorflow/core/common_runtime/bfc_allocator.cc:694] Stats: Limit: 11353470976 InUse: 8930711040 MaxInUse: 11319364608 NumAllocs: 1495009 MaxAllocSize: 4804771072

W tensorflow/core/common_runtime/bfcallocator.cc:270] ****____*****____**_*****xxxxxxxxxxxxxxxx W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 1.75GiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:900] Resource exhausted: OOM when allocating tensor with shape[7,64,1024,1024] Traceback (most recent call last): File "/home/pmobade/project_latest/12july2016/grey/cnn_multi_gpu_train.py", line 342, in tf.app.run() File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv)) File "/home/pmobade/project_latest/12july2016/grey/cnn_multi_gpu_train.py", line 338, in main train() File "/home/pmobade/project_latest/12july2016/grey/cnn_multi_gputrain.py", line 256, in train , loss_value, mse_value, mymse_value, logits_value, labels_value = sess.run([train_op, loss, mse, mymse, logits, labels]) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 340, in run run_metadata_ptr) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 564, in _run feed_dict_string, options, run_metadata) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 637, in _do_run target_list, options, run_metadata) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 659, in _do_call e.code) tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[7,64,1024,1024] [[Node: tower_3/gradients/tower_3/pool1_grad/MaxPoolGrad = MaxPoolGrad[data_format="NHWC", ksize=[1, 3, 3, 1], padding="SAME", strides=[1, 2, 2, 1], _device="/job:localhost/replica:0/task:0/gpu:3"](tower_3/conv1/conv1, tower_3/pool1, tower_3/gradients/tower_3/norm1_grad/LRNGrad/_2719)]] [[Node: tower_3/gradients/tower_3/conv1/BiasAdd_grad/tuple/control_dependency_1/_2721 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:3", send_device_incarnation=1, tensor_name="edge_3317_tower_3/gradients/tower_3/conv1/BiasAdd_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] Caused by op u'tower_3/gradients/tower_3/pool1_grad/MaxPoolGrad', defined at: File "/home/pmobade/project_latest/12july2016/grey/cnn_multi_gpu_train.py", line 342, in tf.app.run() File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv)) File "/home/pmobade/project_latest/12july2016/grey/cnn_multi_gpu_train.py", line 338, in main train() File "/home/pmobade/project_latest/12july2016/grey/cnn_multi_gpu_train.py", line 182, in train grads = opt.compute_gradients(loss) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 241, in compute_gradients colocate_gradients_with_ops=colocate_gradients_with_ops) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients.py", line 481, in gradients in_grads = _AsList(grad_fn(op, *out_grads)) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/nn_grad.py", line 251, in _MaxPoolGrad data_format=op.get_attr("data_format") File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 710, in _max_pool_grad data_format=data_format, name=name) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op op_def=op_def) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2154, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1154, in init self._traceback = _extract_stack()

...which was originally created as op u'tower_3/pool1', defined at: File "/home/pmobade/project_latest/12july2016/grey/cnn_multi_gpu_train.py", line 342, in tf.app.run() [elided 1 identical lines from previous traceback] File "/home/pmobade/project_latest/12july2016/grey/cnn_multi_gpu_train.py", line 338, in main train() File "/home/pmobade/project_latest/12july2016/grey/cnn_multi_gpu_train.py", line 173, in train loss, mse, mymse, logits, labels = tower_loss(scope) File "/home/pmobade/project_latest/12july2016/grey/cnn_multi_gpu_train.py", line 69, in tower_loss logits = cnn.inference(images) File "/home/pmobade/project_latest/12july2016/grey/cnn.py", line 184, in inference padding='SAME', name='pool1') File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 341, in max_pool name=name) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 677, in _max_pool data_format=data_format, name=name) File "/home/pmobade/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op op_def=op_def)

oroojlooy commented 8 years ago

Hi, How did you solve the problem? I have a similar problem on nvidid K80 with 11.25 GB of RAM.

NickShahML commented 8 years ago

Would be amazing if there was some sort of way we could calculate how much memory each tensor is using, including backprop calculations.

asanakoy commented 8 years ago

Hi, I use TensorFlow v0.11.0 RC0. A simple Alexnet model while training with batch size 128 eats up to 4800 MiB (with config.gpu_options.allow_growth = True),
while the same model on caffe with bach size 128 takes roughly 2700MiB.

And the speed is not fast (~2 times slower than caffe).

tiagofrepereira2012 commented 7 years ago

I'm having similar issue here.

VGG16 (which is around 500MB) + a batch size of 128 samples and I run out of memory. I have a Tesla K40m with 11GB of memory.

How can I deal with that?

Thanks for any answer

N2ITN commented 7 years ago

Same problem. I can't load the VGG model with my GTX 970, I get 'Resource ExhaustedError: OOM' before the batching even starts.

globalcaos commented 7 years ago

Similar problem here, keras VGG16 and running out of memory with Quadro K1000M (2GB). It ends execution using CPU instead, I believe. However, it works with a GeForce GTX 960 (4GB). Batch size = 32

absudabsu commented 7 years ago

Similar problem here. I'm not training on a GPU, but my machine has 64gb RAM. I have to adjust the batch size, or else I get a Segmentation fault (core dumped) error on the optimizer (backprop) step.

However, monitoring the RAM usage (e.g. htop), it never crawls above 4GB. So I am inclined to believe this is a bug with tensorflow overestimating the memory required. Backprop should scale linearly with the batch size, and I am seeing this until it faults at an arbitrary point.

Any ideas?! This is severely limiting my ability to train large models. Using a small batch size is just not going to cut it! -- the estimation of the gradient direction is very poor for large models if the batch size is too small.

ashtawy commented 7 years ago

I had a similar problem and solved it by releasing the GPU memory utilized by another process. Use the command nvidia-smi to check the GPU's Memory-Usage. If it is not 0MiB, kill the process that's allocating the GPU's memory even if that process is setting idle. Then run the script that trains your TF model.

In many cases, when a process (your TF training program for example) does not end normally, say due to an error, it holds on to the memory allocated for it until the program is manually killed ($kill -9 PID).

N2ITN commented 7 years ago

@ashtawy Absolutely, also I'd recommend using watch nvidia-smi to get a live view of GPU memory utilization

mijung-kim commented 7 years ago

I also have very similar issue, no solutions yet? Thank you for resolving this issue in advance though!

kiranvaidhya commented 7 years ago

Is anybody still working in this issue? I get this following error out of nowhere when I'm training a network on multiple GPUs akin to the CIFAR-10 tutorial.

E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes           
F tensorflow/core/common_runtime/gpu/gpu_device.cc:104] EigenAllocator for GPU ran out of memory when allocating 0. See error logs for more detailed info.                                  
Aborted (core dumped)

yaroslavvb commented 7 years ago

@kvrd18 I'm working on it (not affiliated with TensorFlow team). Finding memory-efficient execution order is an NP-hard problem, but there are usable heuristics.

One piece is a better node scheduling algorithm -- https://github.com/yaroslavvb/stuff/tree/master/linearize

This algorithm will add control dependencies to force a particular order on a graph. In particular, graph like below will always use constant memory to execute after running through linearize. (TensorFlow default algorithm uses memory proportional to length of the graph in worst case)

asanakoy commented 7 years ago

Try to set a session config:

session_config = tf.ConfigProto(log_device_placement=False, allow_soft_placement=True)    
# please do not use the totality of the GPU memory
session_config.gpu_options.per_process_gpu_memory_fraction = 0.90

yaroslavvb commented 7 years ago

@kvrd18 BTW, for tracking down out of memory errors, here are two tools to make it easier:

https://github.com/yaroslavvb/memory_probe_ops -- it's a tensorflow op you can insert somewhere in the graph and evaluate in sess.run to get the amount of memory allocated at that point in time
https://github.com/yaroslavvb/memory_util -- this tool lets you see timeline of all tensor allocations and deallocations.

After looking at out of memory situations from various large networks (densenet, pixelnet), I found no "low-hanging fruit" aside from the "linearize" tool mentioned above. IE, TensorFlow only allocates memory for tensors that are required for computation, and this memory is released as soon as the tensor has been consumed.

Out of memory situations tend to occur while computing gradients. If you have a computation with "k" operations in a sequence, each operation producing B bytes, then you need to save their outputs in order to compute backprop. This means k*B memory in peak. This is different from forward prop where older tensors can be discarded. So for instance, a network with 100 operations, with each processing 1GB of data only needs 2GB to run forward prop, but 100GB to do backprop.

There are some "high-hanging fruit" I'm looking on atm, in particular, such as discarding some parts of computation, and then recomputing it. One way to do this in current tensorflow to wrap blocks into functions, this way intermediate values inside of each function block will be recomputed and hence don't need to be stored in memory, here's an example -- https://github.com/yaroslavvb/notebooks/blob/master/saving%20memory%20by%20using%20functions.ipynb

asanakoy commented 7 years ago

@yaroslavvb, Thanks for great tips!

One way to do this in current tensorflow to wrap blocks into functions, this way intermediate values inside of each function block will be recomputed and hence don't need to be stored in memory, here's an example

Thats a bit counterintuitive. How is it controlled? Why does TF discard intermediate values inside the function (despite all intermediate ops will be added to the graph anyhow)?

yaroslavvb commented 7 years ago

@asanakoy a function.Defun node is treated as a single tensorflow node for the purpose of backprop. IE, similar to what happens in a single op launch -- there may be intermediate temporary variables that could've been useful for the backprop kernel, but they get recomputed rather than reused. I haven't checked precisely how it's implemented with function.Defun, but I suspect that the backprop graph has a copy of function.Defun graph with same input as original function.Defun.

JStech commented 7 years ago

@yaroslavvb, I have a question about backprop, based on your comment above. For a given layer, don't we only need to keep the gradients on the outputs long enough to calculate the gradients on the inputs (and weights)?

yaroslavvb commented 7 years ago

Correct, the gradients (backprop values) can be released quickly and don't affect peak memory much. It's the activations (forward prop values) that have to be kept in memory for a long time, here's a diagram of how things work -- https://github.com/tensorflow/tensorflow/issues/4359#issuecomment-269241038

asanakoy commented 7 years ago

@yaroslavvb, where is function.Defun defined? I cannot get how you imported it.

yaroslavvb commented 7 years ago

That example is incomplete, but this information can be easily found on the internet (here's a complete example https://github.com/yaroslavvb/stuff/blob/master/node-merge.ipynb)

zhangqianhui commented 7 years ago

@yaroslavvb @asanakoy my program using tensorflow , with the increase of iteration , the consume memory will increase. Can you help me to explain it?

yaroslavvb commented 7 years ago

@zhangqianhui there are many ways for memory usage to grow without being a bug in TensorFlow (ie, your taining may be requesting increasing amounts of memory from TF). You could be allocating new variables, or modifying your graph between iterations. One possible (but unverified) possibility is that if you request many different sizes of tensors, your memory fragmentation could grow which increases memory usage

zhangqianhui commented 7 years ago

@yaroslavvb Thanks

zhangqianhui commented 7 years ago

@yaroslavvb The key , my mean is CPU memory . not GPU memory.

zhangqianhui commented 7 years ago

I have found the reason , In my program , every iteration will call the function. the code

def sample_prior(self, batch_size):
        ret = []
        for dist_i in self.dists:
            ret.append(tf.cast(dist_i.sample_prior(batch_size), tf.float32))
        return tf.concat(1, ret)

@yaroslavvb the function will consume my gpu mempry . Do you know why? And how to sovle it ?

kiranvaidhya commented 7 years ago

@yaroslavvb Does the current tensorflow version release the gradOutputs and the activations from the memory as soon as the gradInputs are computed? The peak memory will be achieved when the first gradInput is computed, if I'm correct.

yaroslavvb commented 7 years ago

The memory is released as soon as soon as the tensor is not needed by downstream consumers. So it depends how your gradInput/gradOutputs are wired (ie, you can rewire them to be more memory efficient, like here) .

Even though TensorFlow releases memory right after it's needed, but there's nothing forcing TensorFlow to allocate memory right before it's needed. So TF could compute an op early and hold it's output in memory longer than necessary, which can increase peak memory. This greedy approach favors speed in multi-device setting over memory efficiency.

In a single device setting, a simple heuristic of executing an op as late as possible saves memory without affecting computation speed, I use this utility to force TensorFlow to compute ops as late as possible -- https://github.com/yaroslavvb/stuff/blob/master/linearize

botev commented 7 years ago

@yaroslavvb When executing on the GPU is there any way to interrogate the current memory pool allocated by tensorflow when using gpu_options.allow_growth = True. You must keep somewhere the full pool of the device memory manager, but how can we access the number (here I'm referring without logging it to some ifle).

yaroslavvb commented 7 years ago

@botev you can use https://github.com/yaroslavvb/memory_util to see timeline of all allocations/deallocations (CPU/GPU), and you can use https://github.com/yaroslavvb/memory_probe_ops to query memory usage at point during session.run call (GPU only)

botev commented 7 years ago

@yaroslavvb I'm assuming that if I run the probe op in a session together with computation of a model this would return me the peek memory usage, is that correct?

yaroslavvb commented 7 years ago

it would give you memory usage at the time when the op was executed. If you want peak memory usage, you can use MaxBytesInUse added in https://github.com/tensorflow/tensorflow/commit/ccf9a752

tensorflow.contrib.memory_stats.python.ops.memory_stats_ops
max_bytes_in_use = sess.run(memory_stats_ops.MaxBytesInUse())

ericloveland commented 7 years ago

I have been trying to find the largest models my box will run (neural_gpu) given that my box has 128gb ram and 3x 1080 gpu's, but tensorflow seems to run out of memory arbitrarily when it is only using about 71.5gb (minus a bit for the OS)... the dumps say that the total allocs = 64gb, so I wonder if it is self limiting to 64gb somewhere... when I built tensorflow and the box it only had 64gb... swap stays at zero... would rebuilding tensor flow help or is there a setting that could increase this?

ericyuli commented 7 years ago

I also face memory thing when I train a transfer learning based model. I solve it by freezing some layers by pre-trained weights and fine tuning layer from layer. It works in this way. But when I train the entire model, too many trainable parameters will still cause the OOM in AWS.

yaroslavvb commented 7 years ago

@ericloveland there was some memory issues in older version, maybe trying the latest tf would help

ericloveland commented 7 years ago

@yaroslavvb I'm running tf 1.0 from about a month ago... are you saying there have been relevant changes since 1.0 came out?? Thanks!

yaroslavvb commented 7 years ago

@ericdnielsen in last 2 weeks there was a change that made all cwise ops run in place when possible, so that will reduce usage. But I suspect you have a deeper problem with your architecture requiring too much memory to evaluate in some cases

It's weird that you are getting dumps for running out of CPU memory. On my machine TensorFlow just uses as much CPU memory as is available, until the machine freezes.

The order of execution is random, so different runs may use different amounts of memory. You could use my linearize util to fix a single memory efficient order, and debug your excessive memory usage from there. (ie, using http://github.com/yaroslavvb/memory_util)

Shoshin23 commented 7 years ago

I face a similar memory issue while trying to train my tensorflow model. I use a Tesla K80 and the error I get is "Ran out of memory trying to allocate 102.65MiB." Any general way to avoid this?

ericloveland commented 7 years ago

I think it was an architecture issue... I chose some different params (smaller nmaps, and more layers) and tensorflow happily ran even when it had to start using the swap file. It appears that nmap size is limited by the memory size of the GPU, while the number of layers is limited by the amount of CPU memory... other things like (batch size, and sample size) being the same. The seemingly sporadic running out of memory appears to be due to when the model runs up on a larger sample, it sometimes needs to create a larger RNN model/cell under the covers... I think. Thanks!

ericloveland commented 7 years ago

you could try shrinking your batch size down to 1 and then shrinking other params down until memory issues stop... so far it seems that models are limited by GPU memory for some params and by CPU memory for other params.

botev commented 7 years ago

@yaroslavvb It seems that the memory_stats_op.MaxBytesInUse() is wrong. I'm running a single tensorflow instance and that being the only thing running on that GPU. Under nvidia-smi there are 8459MB memory used on the tensorflow, the op returns 4254MB. I have no idea what is going wrong but this is clearly wrong

memory_stats_ops = tf.load_op_library(memory_stats_ops_loc)
stats_op = memory_stats_ops.max_bytes_in_use()
memory = session.run((train, cost, stats_op), feed_dict=feed_dict)[2] // (1024 * 1024)

PS: I'm loading it as an external library, as we are using r1.0 which do not have the op and we can not use non stable releases.

yaroslavvb2 commented 7 years ago

Tensorflow uses its own caching allocator, so nvidia-smi gives incorrect result

On Mar 11, 2017 5:35 AM, "Alexander Botev" notifications@github.com wrote:

@yaroslavvb https://github.com/yaroslavvb It seems that the memory_stats_op.MaxBytesInUse() is wrong. I'm running a single tensorflow instance and that being the only thing running on that GPU. Under nvidia-smi there are 8459MB memory used on the tensorflow, the op returns 4254MB. This seems criminally like the tensorflow op is returning exactly half the memory actually being used. I'm using it like this:

memory_stats_ops = tf.load_op_library(memory_stats_ops_loc) stats_op = memory_stats_ops.max_bytes_in_use() memory = session.run((train, cost, stats_op), feed_dict=feed_dict)[2] // (1024 * 1024)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/492#issuecomment-285866983, or mute the thread https://github.com/notifications/unsubscribe-auth/AYoxVJzBid9_o5mfSczGZtDl8QXLjyfZks5rkqMYgaJpZM4G0MRi .

botev commented 7 years ago

@yaroslavvb2 is that 100% sure. Because we observe on smaller Feed Forward models Tensorflow to be able to measure about a half memory usage than theano and torch, but on larger models it uses the same. Additionally, we run everything with the flag allow_growth, so there is no reason why tensorflow should allocate x2 more memory than needed?

yaroslavvb commented 7 years ago

@botev correct, even with allow_growth, the amount of available memory reported by nvidia-smi does not represent the amount of memory available

Shaofanl commented 7 years ago

My solution is somehow tricky. You can build your network with standard Keras backend. During development, you can use Tensorflow as backend because compiling a model in Theano takes more time than you can endure, especially when the optimization option is on. You might suffer from small batch size and slow computation with Tensorflow, but what you have to do is just to make sure your network structure is correct. After that, your can simply switch to Theano backend in the configuration file of Keras. Since you used the standard Keras backend when build the network, all your function calls of Tensorflow API will automatically switch to corresponding ones of Theano API. It might take a while to compile the model in Theano, but you might use a larger batch size and enjoy a fast computation.

Notice that Theano and Tensorflow differ on the arrangement of data, such as the axis of filter channel. I use the Theano's way in both development and deployment in that it can boost Theano's computation. And that Theano does not have versatile tools such as TensorBoard in TensorFlow. And hence if you want to visualize or analyze the trained model, feel free and safe to switch back to TensorFlow. And that if some function is not implemented by Keras backend, you can write a simple condition statement to do that. Such as eigenvalue decomposition of a self-adjoint matrix:

  import keras.backend as K
  # K.eig() does not exist
  if K.backend()=='theano':
    eig, eigv = K.theano.tensor.nlinalg.eig(matrix)
  elif K.backend() == 'tensorflow':
    eig, eigv = K.tf.self_adjoint_eig(matrix)
  else:
    raise NotImplemented

tensorflow / tensorflow

memory issues #492