harvardnlp / seq2seq-attn

Sequence-to-sequence model with LSTM encoder/decoders and attention
http://nlp.seas.harvard.edu/code
MIT License
1.26k stars 278 forks source link

Optimization issues #31

Closed nicolas-ivanov closed 8 years ago

nicolas-ivanov commented 8 years ago

Greetings! I've tried both sgd and adagrad optimizers provided by defaul and with both of them I failed to train any good models.

While training with sgd I took the default params and the model converged after 15 epochs with the perplexity of 74 on the training set and 125 on the validation set. Learning rate dropped almost to 0 at this point. Apparently, the vanilla sgd with the proposed learning decay strategy is not the best choice...

Hence I hooked up adagrad for training 4 slightly different models (there is a difference in rnn_size, num of layers, dropout and in the usage of bidir_lstm for the encoder, however for all the models starting learning rate is 1 and learning decay 0.5, even though I assume the latter plays no role when using adagrad). After training the models on my GPUs for almost a day I still get perplexity values that at best have 14 digits :)

Here is the sample of the training logs:

Train   17676577270078  
Valid   3.9688841882487e+19
saving checkpoint to no_feed_epoch16.00_39688841882487062528.00.t7
Epoch: 17, Batch: 250/8274, Batch size: 64, LR: 1.0000, PPL: 4725532035834.91, |Param|: 67331.00, |GParam|: 137.76, Training: 4320/1044/3276 total/source/target tokens/sec
Epoch: 17, Batch: 500/8274, Batch size: 64, LR: 1.0000, PPL: 3359570332666.66, |Param|: 67335.08, |GParam|: 215.59, Training: 4315/1035/3279 total/source/target tokens/sec
Epoch: 17, Batch: 750/8274, Batch size: 64, LR: 1.0000, PPL: 5181787313557.71, |Param|: 67339.10, |GParam|: 21.83, Training: 4314/1036/3277 total/source/target tokens/sec

That's a bit too much. Am I doing something wrong or is this an expected behaviour?

Lastly, are you planning to incorporate other optimisers from torch/optim package?

yoonkim commented 8 years ago

Hmm I think your perplexity seems OK with vanilla sgd? (the model isn't supposed to learn a good translation from the demo dataset as it is too small).

With adagrad you want your global learning rate to be much smaller, like 0.1 or 0.01. I personally prefer vanilla sgd but my officemate swears by adagrad.

currently i don't have plans to incorporate other optimizers, as in my experience i haven't seen adadelta/adam convincingly outperform vanilla sgd with the LSTM seq2seq architecture (GRU doesn't seem to work well with vanilla sgd though). but feel free to send a PR if you've implemented these!

hope this helps!

nicolas-ivanov commented 8 years ago

Gotcha, thanks for the quick response!

wangliangguo commented 8 years ago

@nicolas-ivanov can you tell me how to implement the optimization with sgd and adagrad optimizers provided by default ? or showed your codes here. I am a newcomer to torch and i really want to know.

nicolas-ivanov commented 8 years ago

@chenwangliangguo did not get what you are exactly aiming for. Do you want to train your models with these optimisers or to implement sgd and adagrad yourself? For the latter see this function for adagrad and this line for sgd.

wangliangguo commented 8 years ago

@nicolas-ivanov currently the code is manually computer all the gradients and then use the parameters' value minus it. What i wonder is how to use optim package finished such a task?