use standard adam parameters

bastings commented 8 years ago

With the current Adam parameters it is almost impossible to train a model. I suggest using the "standard" Adam parameters as recommended by the original paper. See http://arxiv.org/pdf/1412.6980.pdf I have tested them on session 2 with success.

kyunghyuncho commented 8 years ago

@bastings thanks for making the PR! Can you provide a bit more details on how this change improved your training?

bastings commented 8 years ago

With the current values cost does not go down from about 44, using the attention-based enc-dec of session 2. With the values in the PR cost goes down. Was there any reason to not use the default adam hyperparameters?

jli05 commented 8 years ago

@bastings Would it be possible to test over several data sets or language pairs?

From this I think maybe it'd be good to put a table in the README for training/testing metrics, for us self-learners?

bastings commented 8 years ago

Maybe that would be overdoing it a bit, but what about I post the loss on a small validation set comparing the 2 Adam initialisations?

Still, besides this, I think it would be good practice to have the recommended (and widely used) Adam parameters as standard. A stronger case is needed to keep the (current) alternative ones, especially since they do not seem to work well.

kyunghyuncho commented 8 years ago

@bastings that'll be awesome (the loss on a small validation set.)

I believe the current Adam parameters are from the earlier version of the paper on arXiv. There's been a number of updates to the paper, but I haven't bothered to follow up on it.

bastings commented 8 years ago

Here we go (BTEC spoken corpus Zh->En, 44k training sentences, 1k dev set)

6 epochs using old Adam parameters (lr=0.0002, b1=0.01, b2=0.0001) Valid loss: 43.5281 Valid loss: 43.4839 Valid loss: 43.4748 Valid loss: 43.4692 Valid loss: 43.4664 Valid loss: 43.4649

6 epochs Using new Adam parameters (lr=0.001, b1=0.99, b2=0.999): Valid loss: 35.0617 Valid loss: 29.2072 Valid loss: 25.3714 Valid loss: 23.5589 Valid loss: 22.5608 Valid loss: 21.8455

However, I noticed later that in the latest Adam paper b1 and b2 are used inverted from how they are in the code. The code can be updated like this to reflect it:

    m_t = (b1 * m) + ((1. - b1) * g)
    v_t = (b2 * v) + ((1. - b2) * tensor.sqr(g))

This also means that my second run was using the same parameters as the first (because it was using b1 and b2 inverted), except for the learning rate.

It might be good to update the code to reflect the latest Adam version, so that the parameters work as intended. But as for the default parameters, now I am less sure.

orhanf commented 8 years ago

Hey @bastings, we actually have the latest implementation of adam (bias-corrected - 8th version of the paper) here, but of course, i agree we should put this latest version here in the tutorial as well. Would you mind changing the PR, or i can make another one. Thanks a lot for the effort and the numbers btw.

bastings commented 8 years ago

Nice @orhanf , I'll change the PR with this. Are the ups and no_gshared features necessary to have for this tutorial?

orhanf commented 8 years ago

@bastings, ups and no_gshared are not necessary for the tutorial, thanks a bunch.

bastings commented 8 years ago

Hi @orhanf it looks like that is an implementation of the slower version (Algorithm 1) of Adam, but using the learning rate calculation of the more efficient version (last paragraph before par 2.1), i.e. you don't need to calculate:

m_t_hat = m_t / (1 - fix1)
v_t_hat = v_t / (1 - fix2)

if you use lr_t = lr * (tensor.sqrt(1 - fix2) / (1 - fix1))

Do you agree?

bastings commented 8 years ago

@orhanf @kyunghyuncho shall I close this and make a new PR with the corrected Adam code? Would be great if you could confirm the above before I do so.

orhanf commented 8 years ago

Hi @bastings, sorry for the latency. I think its better to keep both versions (slower and fast), but specifying a flag to chose between them. And thanks again for the effort.

bastings commented 8 years ago

@orhanf the 'slow' version is presented in the Adam paper just for clarity, why would anyone want to use that? Every library I know implements the optimised version.

orhanf commented 8 years ago

okay, fast one only then :)

bastings commented 8 years ago

@orhanf cheers, coming up in a new PR :-)

nyu-dl / dl4mt-tutorial

use standard adam parameters #54