Loss is exploding, aborting

karpathy / char-rnn

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch

11.59k stars 2.58k forks source link

Loss is exploding, aborting #169

Closed mehanelson closed 8 years ago

mehanelson commented 8 years ago

I am using an amazon ec2 instance ubuntu 14.04 with gpu (g2.8xlarge). I am running the code on an input.txt file of size 221 MB. The code after running for 4 days, gave the following error: Loss is exploding, aborting

Can you please give some ideas/suggestions on how to fix this problem? What should one do to run on large input files. Any thoughts on this are welcome. Thanks!

[I tested the code on smaller input files, and it works perfectly fine]

soumith commented 8 years ago

try lowering the learning rate.

lizhihuan commented 8 years ago

I comment the statement at the end of the code file if loss[1] > loss0 * 3 then, and it can run successfully. The train loss descend as i expected.

mehanelson commented 8 years ago

Thank you! I shall try both and get back to you once I get the results.

mehanelson commented 8 years ago

@lizhihuan I tried commenting it and it no longer gives me the error. Thank you!

mehanelson commented 8 years ago

I am running on a huge dataset. I started running on July 5th and it still running and may run for a couple of weeks. The loss had reached from 0.9 to 17 suddenly so I was getting the error. However, on commenting the statement mentioned by @lizhihuan the program didn't abort when the same thing happened. The loss reduced considerably to 3 after reaching 17 but its still a high value. Last time, I checked it had reached 7. So, lowering the learning rate help or is there something else I need to do? Thanks!

mehanelson commented 8 years ago

Well after a lot of reading, it seems I just need to fine tune the parameters and see the combination that works best. I am playing primarily with the learning rate at the moment. Thanks for all the help :)