sherjilozair / char-rnn-tensorflow

Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow
MIT License
2.64k stars 960 forks source link

How to reduce GPU memory? #58

Open ckcz123 opened 7 years ago

ckcz123 commented 7 years ago

What a wonderful project! I have used it to solve some problems. But there is one problem that always bothers me.

In one of the cases, I have to use rnn_size=512, num_layers=2, seq_length=1200. Other arguments: batch_size=10, num_epochs=50, grad_clip=5.0, and so on. But it will allocate 7.23GiB in GPU, which is only 8GB-free. So I just wonder if I can reduce GPU memory to 7GiB or less. If so, I can run it on GPU. rnn_size, num_layers, seq_length cannot be modified.

Here is some of the ouputs.

I tensorflow/core/common_runtime/bfc_allocator.cc:689] Summary of in-use Chunks by size: I tensorflow/core/common_runtime/bfc_allocator.cc:692] 22 Chunks of size 256 totalling 5.5KiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 5 Chunks of size 512 totalling 2.5KiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 1280 totalling 1.2KiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 7499 Chunks of size 2048 totalling 14.65MiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1087 Chunks of size 4096 totalling 4.25MiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 4608 totalling 4.5KiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 6144 totalling 6.0KiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 616 Chunks of size 8192 totalling 4.81MiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 9984 totalling 9.8KiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 4 Chunks of size 10240 totalling 40.0KiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 2 Chunks of size 12288 totalling 24.0KiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 303 Chunks of size 14336 totalling 4.14MiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 5 Chunks of size 198656 totalling 970.0KiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 208384 totalling 203.5KiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 919 Chunks of size 8388608 totalling 7.18GiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 10775552 totalling 10.28MiB I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 14428160 totalling 13.76MiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 7.23GiB I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats: Limit: 7967745639 InUse: 7764832256 MaxInUse: 7764842496 NumAllocs: 60834 MaxAllocSize: 14428160

W tensorflow/core/common_runtime/bfc_allocator.cc:270] **** W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 8.00MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:968] Resource exhausted: OOM when allocating tensor with shape[1024,2048] E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 8.00G (8589934592 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 8.00G (8589934592 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

Sorry for my poor English, and thanks a lot!

ckcz123 commented 7 years ago

And I wonder why it need 7.23GiB memory? Can anyone explain it?

fujimotomh commented 7 years ago

I think your seq length is very high. 1200 is quite long. Tensorflow has this issue with the way it creates these kinds of graphs using seq2seq see this issue. You may try to remake it using dynamic_rnn as they suggest in the comments. A quick fix might be to lower your batch size. Though it is already low so your loss may be noiser.

ckcz123 commented 7 years ago

@fujimotomh Thank you for your reply. But I wonder how can I modify the code as I'm just new to tensorflow?

I just tried to use outputs, last_state = tf.nn.rnn(cell, inputs, initial_state=self.initial_state, scope='rnnlm') instead of outputs, last_state = seq2seq.rnn_decoder(inputs, self.initial_state, cell, loop_function=loop if infer else None, scope='rnnlm') , and it can work! (But I don't know if the result is correct).

But if I tried outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=self.initial_state, scope='rnnlm') , it would throwValueError: Dimension must be 2 but is 3.

So how can I modify the code?

fujimotomh commented 7 years ago

@ckcz123 You almost have it. dynamic_rnn takes the input as a tensor and not a list. This works on my laptop with a seq_length of 1200.

outputs, last_state = tf.nn.dynamic_rnn(cell, tf.nn.embedding_lookup(embedding, self.input_data), initial_state=self.initial_state, scope='rnnlm')

To confirm correctness, I think the best thing to do would be to run it with default parameters and see if you can get low loss on the training set. I would suspect this would work though as rnn_decoder and dynamic_rnn claim have the same function.

ckcz123 commented 7 years ago

@fujimotomh Oh, it works! Only 1.1G usage of GPU memory! Thanks for your advice!