Closed apaszke closed 9 years ago
It speeds up training a lot. The grad/param norm ratio drops from around 3e-01 to 4e-03, which seems much better.
Thanks for reporting this. This is in fact a bug. The loss should be normalized by both batch size and sequence length, and the gradient of course has to be normalized with it otherwise it shouldn't even gradient check. I see that I normalize the loss on line 246 with loss = loss / opt.seq_length
, but I never scale the gradient, as you suggested.
Okay, we have to fix this. My only concern is that the hyperparameters I've chosen as defaults could now be suboptimal once we fix this issue.
They probably will be wrong. The learning rate is probably well suited for length of 50, but not for any other. But after the fix, they may be more universal ;)
Okay, thanks a lot for creating the issue. I will work on the fix and rerun crossvalidation over the learning rate magnitude so that default params work well.
No problem! Thanks for an awesome LSTM implementation! :+1: I've always thought that they were hard to learn, but just reading through Your code did the job. There are hardly any good tutorials on them.
Thanks. I know there are very few materials : ( I started working on a video tutorial as a companion to char-rnn that is a more detailed walk-through of the model/math/code. Except it's going a bit slowly due to many other distractions (research etc). Ugh. Anyways, thanks!
Nice! The code is well commented, so it was quite easy to get it all anyway (as long as you know some torch and lua). I've also found annotations from the nngraph useful to better understand what is going on inside. It might be worthwhile to add them to the model. I could do a pull request if you wish
Sure, if you happen to find some time annotations would be nice!
On second thought I think that you may not have to do the learning rate search. The only change to the gradient is dividing by 50 (default seq_length), so simply multiplying the default learning rate by 50 should be enough (if the clamp on gradient doesn't affect the training that much).
it's more complicated than that. Gradients don't scale linearly as you indicated based on seq_length. They might near the beginning of optimization but after that it gets funny, because the positive and negative gradients cancel each other out and you end up with more of sqrt() kind of effect, I think. I'll rerun it, but about a scale of ~10 is expected. Also, the RMSProp dynamics might also make things more complicated.
True, I haven't thought about alternating gradient sign. :confused:
Hi guys, I found that the training loss couldn't decease with the default parameters and the tinyshakespeare dataset. As the following log shows, after 1236 epoches the training loss is about 1.13 while the validation loss is 1.47. Is this still underfitting, or this might be caused by the gradient bug in this issue?
522999/2115000 (epoch 1236.404), train_loss = 1.13484775, grad/param norm = 9.7163e-02, time/batch = 0.21s
evaluating loss over split index 2
1/23...
2/23...
3/23...
4/23...
5/23...
6/23...
7/23...
8/23...
9/23...
10/23...
11/23...
12/23...
13/23...
14/23...
15/23...
16/23...
17/23...
18/23...
19/23...
20/23...
21/23...
22/23...
23/23...
saving checkpoint to cv/lm_lstm_epoch1236.41_1.4733.t7
523000/2115000 (epoch 1236.407), train_loss = 1.09940496, grad/param norm = 9.8000e-02, time/batch = 0.18s
I don't think this issue affects Your problem. The learning rate has been chosen with cross-validation using default sequence length, so as long as You didn't change it everything should work fine.
Having the validation loss much higher than training is a sign of overfitting, not underfitting. However if You're still unsatisfied with the performance you can increase the model size, but since You're already overfitting You should probably introduce some dropout as well.
Ok now that I wrote out the rmsprop update I realized that scaling the gradients by a constant is a noop because of the rmsprop dynamics with the accumulating denominator, which ends up canceling the scale in the gradient. This means that doing grad_params:div(a)
for any scalar a
is a noop (although not entirely exactly because of the epislon), but in any case mostly matter of numerical efficiency. Does that make sense? In this light I'm inclined to leave the current behavior untouched.
You're right. I've thought it through and it makes perfect sense for RMSprop. I've checked it now and it runs ok even without this line. I think I had some problems, because in the past I was using some other optimization algorithms (probably momentum), which required me to use different learning rates for different lengths. Therefore I think it's better to include this line just for the sake of correctness. Many people are reading this code and using it as a starting point for their own implementations. This fix is really simple, but it's up to you :blush:
Good point, I opted to include a comment but I'm leaving code unchanged. (see https://github.com/karpathy/char-rnn/commit/4297a9bf69726823d944ad971555e91204f12ca8)
closing the issue. Thanks for bringing this up!
Hi! I've used char-rnn as a base for a different learning task, which requires me to train the networks on much longer sequences. However I've quickly realized that the training is very unstable and requires a ridiculously small learning rate. After some fiddling I tried different sequence lengths and got the following results:
Apparently the gradient is growing approximately linearly with the sequence length. Fortunately the fix is quite simple as it's enough to add:
grad_params:div(opt.seq_length)
at train.lua:267 I'm not sure if it's also a problem for char-rnn's use case, but it seems logical, that the gradient should be of similar magnitude regardless of theseq_length
.After adding the line I'm getting the following results: