Closed hanfeisun closed 6 years ago
I run the training for scotus.bz2
data by typing in python train.py
. The program is run on AWS p2.8xlarge instance (12 GB GPU memory per GPU).
And it throws the following exception
Limit: 11330676327
InUse: 11035230976
MaxInUse: 11038110976
NumAllocs: 10139
MaxAllocSize: 71999744
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 34.33MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[3000,3000]
Not a direct answer to your issue, but I don't think this lib is parallelized over multiple GPUs (so a p2.8xlarge is just the same as a p2.xlarge)
On Tue, Mar 14, 2017 at 9:57 PM Hanfei Sun notifications@github.com wrote:
I run the training for scotus.bz2 data by typing in python train.py. The program is run on AWS p2.8xlarge instance (12 GB GPU memory per GPU).
And it throws the following exception
Limit: 11330676327 InUse: 11035230976 MaxInUse: 11038110976 NumAllocs: 10139 MaxAllocSize: 71999744
W tensorflow/core/common_runtime/bfc_allocator.cc:274] **** W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 34.33MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[3000,3000]
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pender/chatbot-rnn/issues/9#issuecomment-286557712, or mute the thread https://github.com/notifications/unsubscribe-auth/AAT7sR8UnLXJPImpsyjJ0VTP5wUL6jp7ks5rlv84gaJpZM4MdAzS .
@julien-c Thanks for the information! I used to think using p2.8xlarge can solve the issue.
When I set batch_size to 10, and rnn_size to 50, the Out Of Memory issue disappears.
It would be great if the default parameter could be set to fit a ordinary GPU's memory..
Default parameters are set to obtain the best global results that I could, pushing my GPU to the limit. It really makes a difference with a chatbot... char-rnn
is probably less sensitive to performance because it's not interactive, while the chatbot gets a lot more responsive and topical as performance improves.
I encountered the following error during the training stage.. Does anyone have ideas about this?