karpathy / char-rnn

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch
11.6k stars 2.59k forks source link

Gradient magnitute depends on seq_length #70

Closed apaszke closed 9 years ago

apaszke commented 9 years ago

Hi! I've used char-rnn as a base for a different learning task, which requires me to train the networks on much longer sequences. However I've quickly realized that the training is very unstable and requires a ridiculously small learning rate. After some fiddling I tried different sequence lengths and got the following results:

Sequence length Gradient norm
10 3.00e+00
250 6.11e+01
500 1.05e+02

Apparently the gradient is growing approximately linearly with the sequence length. Fortunately the fix is quite simple as it's enough to add: grad_params:div(opt.seq_length) at train.lua:267 I'm not sure if it's also a problem for char-rnn's use case, but it seems logical, that the gradient should be of similar magnitude regardless of the seq_length.

After adding the line I'm getting the following results:

Sequence length Gradient norm
10 2.93e-01
250 2.85e-01
500 3.00e-01
apaszke commented 9 years ago

It speeds up training a lot. The grad/param norm ratio drops from around 3e-01 to 4e-03, which seems much better.

karpathy commented 9 years ago

Thanks for reporting this. This is in fact a bug. The loss should be normalized by both batch size and sequence length, and the gradient of course has to be normalized with it otherwise it shouldn't even gradient check. I see that I normalize the loss on line 246 with loss = loss / opt.seq_length, but I never scale the gradient, as you suggested.

Okay, we have to fix this. My only concern is that the hyperparameters I've chosen as defaults could now be suboptimal once we fix this issue.

apaszke commented 9 years ago

They probably will be wrong. The learning rate is probably well suited for length of 50, but not for any other. But after the fix, they may be more universal ;)

karpathy commented 9 years ago

Okay, thanks a lot for creating the issue. I will work on the fix and rerun crossvalidation over the learning rate magnitude so that default params work well.

apaszke commented 9 years ago

No problem! Thanks for an awesome LSTM implementation! :+1: I've always thought that they were hard to learn, but just reading through Your code did the job. There are hardly any good tutorials on them.

karpathy commented 9 years ago

Thanks. I know there are very few materials : ( I started working on a video tutorial as a companion to char-rnn that is a more detailed walk-through of the model/math/code. Except it's going a bit slowly due to many other distractions (research etc). Ugh. Anyways, thanks!

apaszke commented 9 years ago

Nice! The code is well commented, so it was quite easy to get it all anyway (as long as you know some torch and lua). I've also found annotations from the nngraph useful to better understand what is going on inside. It might be worthwhile to add them to the model. I could do a pull request if you wish

karpathy commented 9 years ago

Sure, if you happen to find some time annotations would be nice!

apaszke commented 9 years ago

On second thought I think that you may not have to do the learning rate search. The only change to the gradient is dividing by 50 (default seq_length), so simply multiplying the default learning rate by 50 should be enough (if the clamp on gradient doesn't affect the training that much).

karpathy commented 9 years ago

it's more complicated than that. Gradients don't scale linearly as you indicated based on seq_length. They might near the beginning of optimization but after that it gets funny, because the positive and negative gradients cancel each other out and you end up with more of sqrt() kind of effect, I think. I'll rerun it, but about a scale of ~10 is expected. Also, the RMSProp dynamics might also make things more complicated.

apaszke commented 9 years ago

True, I haven't thought about alternating gradient sign. :confused:

ffmpbgrnn commented 9 years ago

Hi guys, I found that the training loss couldn't decease with the default parameters and the tinyshakespeare dataset. As the following log shows, after 1236 epoches the training loss is about 1.13 while the validation loss is 1.47. Is this still underfitting, or this might be caused by the gradient bug in this issue?

522999/2115000 (epoch 1236.404), train_loss = 1.13484775, grad/param norm = 9.7163e-02, time/batch = 0.21s
evaluating loss over split index 2
1/23...
2/23...
3/23...
4/23...
5/23...
6/23...
7/23...
8/23...
9/23...
10/23...
11/23...
12/23...
13/23...
14/23...
15/23...
16/23...
17/23...
18/23...
19/23...
20/23...
21/23...
22/23...
23/23...
saving checkpoint to cv/lm_lstm_epoch1236.41_1.4733.t7
523000/2115000 (epoch 1236.407), train_loss = 1.09940496, grad/param norm = 9.8000e-02, time/batch = 0.18s
apaszke commented 9 years ago

I don't think this issue affects Your problem. The learning rate has been chosen with cross-validation using default sequence length, so as long as You didn't change it everything should work fine.

Having the validation loss much higher than training is a sign of overfitting, not underfitting. However if You're still unsatisfied with the performance you can increase the model size, but since You're already overfitting You should probably introduce some dropout as well.

karpathy commented 9 years ago

Ok now that I wrote out the rmsprop update I realized that scaling the gradients by a constant is a noop because of the rmsprop dynamics with the accumulating denominator, which ends up canceling the scale in the gradient. This means that doing grad_params:div(a) for any scalar a is a noop (although not entirely exactly because of the epislon), but in any case mostly matter of numerical efficiency. Does that make sense? In this light I'm inclined to leave the current behavior untouched.

apaszke commented 9 years ago

You're right. I've thought it through and it makes perfect sense for RMSprop. I've checked it now and it runs ok even without this line. I think I had some problems, because in the past I was using some other optimization algorithms (probably momentum), which required me to use different learning rates for different lengths. Therefore I think it's better to include this line just for the sake of correctness. Many people are reading this code and using it as a starting point for their own implementations. This fix is really simple, but it's up to you :blush:

karpathy commented 9 years ago

Good point, I opted to include a comment but I'm leaving code unchanged. (see https://github.com/karpathy/char-rnn/commit/4297a9bf69726823d944ad971555e91204f12ca8)

closing the issue. Thanks for bringing this up!