Closed the-moliver closed 9 years ago
description for future reference: This PR modified the implementation of Nesterov solver so that the formulas are more consistent with the cited paper (although the old implementation is in an equivalent form).
Out of curiosity, could you show me how the previous implementation was equivalent? I couldn't make it work out when I tried. The code didn't use the last_momentum variable which seemed to be necessary.
@the-moliver Here is my derivation. Sorry I did not use the standard notation in the paper (h means history). Correct me if I'm wrong:
Yup, your derivation looks correct. I'm surprised they didn't use that form in the paper. The only difference with standard momentum then is that the parameters are updated with the history and gamma at time t, rather than t-1, which is kinda cool.
I'll remove the old PR in favor of this one