Closed nicolas-ivanov closed 8 years ago
Hmm I think your perplexity seems OK with vanilla sgd? (the model isn't supposed to learn a good translation from the demo dataset as it is too small).
With adagrad you want your global learning rate to be much smaller, like 0.1 or 0.01. I personally prefer vanilla sgd but my officemate swears by adagrad.
currently i don't have plans to incorporate other optimizers, as in my experience i haven't seen adadelta/adam convincingly outperform vanilla sgd with the LSTM seq2seq architecture (GRU doesn't seem to work well with vanilla sgd though). but feel free to send a PR if you've implemented these!
hope this helps!
Gotcha, thanks for the quick response!
@nicolas-ivanov can you tell me how to implement the optimization with sgd and adagrad optimizers provided by default ? or showed your codes here. I am a newcomer to torch and i really want to know.
@chenwangliangguo did not get what you are exactly aiming for. Do you want to train your models with these optimisers or to implement sgd and adagrad yourself? For the latter see this function for adagrad and this line for sgd.
@nicolas-ivanov currently the code is manually computer all the gradients and then use the parameters' value minus it. What i wonder is how to use optim package finished such a task?
Greetings! I've tried both
sgd
andadagrad
optimizers provided by defaul and with both of them I failed to train any good models.While training with
sgd
I took the default params and the model converged after 15 epochs with the perplexity of 74 on the training set and 125 on the validation set. Learning rate dropped almost to 0 at this point. Apparently, the vanillasgd
with the proposed learning decay strategy is not the best choice...Hence I hooked up
adagrad
for training 4 slightly different models (there is a difference in rnn_size, num of layers, dropout and in the usage of bidir_lstm for the encoder, however for all the models starting learning rate is 1 and learning decay 0.5, even though I assume the latter plays no role when usingadagrad
). After training the models on my GPUs for almost a day I still get perplexity values that at best have 14 digits :)Here is the sample of the training logs:
That's a bit too much. Am I doing something wrong or is this an expected behaviour?
Lastly, are you planning to incorporate other optimisers from torch/optim package?