pender / chatbot-rnn

A toy chatbot powered by deep learning and trained on data from Reddit
MIT License
899 stars 370 forks source link

Ran out of memory error for 12GB GPU memory for 2MB training data? #9

Closed hanfeisun closed 6 years ago

hanfeisun commented 7 years ago

I encountered the following error during the training stage.. Does anyone have ideas about this?

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 34.33MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[3000,3000]
hanfeisun commented 7 years ago

I run the training for scotus.bz2 data by typing in python train.py. The program is run on AWS p2.8xlarge instance (12 GB GPU memory per GPU).

And it throws the following exception

Limit:                 11330676327
InUse:                 11035230976
MaxInUse:              11038110976
NumAllocs:                   10139
MaxAllocSize:             71999744

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 34.33MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[3000,3000]
julien-c commented 7 years ago

Not a direct answer to your issue, but I don't think this lib is parallelized over multiple GPUs (so a p2.8xlarge is just the same as a p2.xlarge)

On Tue, Mar 14, 2017 at 9:57 PM Hanfei Sun notifications@github.com wrote:

I run the training for scotus.bz2 data by typing in python train.py. The program is run on AWS p2.8xlarge instance (12 GB GPU memory per GPU).

And it throws the following exception

Limit: 11330676327 InUse: 11035230976 MaxInUse: 11038110976 NumAllocs: 10139 MaxAllocSize: 71999744

W tensorflow/core/common_runtime/bfc_allocator.cc:274] **** W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 34.33MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[3000,3000]

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pender/chatbot-rnn/issues/9#issuecomment-286557712, or mute the thread https://github.com/notifications/unsubscribe-auth/AAT7sR8UnLXJPImpsyjJ0VTP5wUL6jp7ks5rlv84gaJpZM4MdAzS .

hanfeisun commented 7 years ago

@julien-c Thanks for the information! I used to think using p2.8xlarge can solve the issue.

When I set batch_size to 10, and rnn_size to 50, the Out Of Memory issue disappears.

It would be great if the default parameter could be set to fit a ordinary GPU's memory..

pender commented 6 years ago

Default parameters are set to obtain the best global results that I could, pushing my GPU to the limit. It really makes a difference with a chatbot... char-rnn is probably less sensitive to performance because it's not interactive, while the chatbot gets a lot more responsive and topical as performance improves.