pluskid / Mocha.jl

Deep Learning framework for Julia
Other
1.29k stars 254 forks source link

New PR for Nesterov change #47

Closed the-moliver closed 9 years ago

the-moliver commented 9 years ago

I'll remove the old PR in favor of this one

coveralls commented 9 years ago

Coverage Status

Coverage remained the same at 53.34% when pulling 2b56aeae6188121f7e16ef0b6d6fb1ae23b46c10 on the-moliver:dev into 743e4e0741579e63567a83c4717e22fd282fc602 on pluskid:master.

pluskid commented 9 years ago

description for future reference: This PR modified the implementation of Nesterov solver so that the formulas are more consistent with the cited paper (although the old implementation is in an equivalent form).

the-moliver commented 9 years ago

Out of curiosity, could you show me how the previous implementation was equivalent? I couldn't make it work out when I tried. The code didn't use the last_momentum variable which seemed to be necessary.

pluskid commented 9 years ago

@the-moliver Here is my derivation. Sorry I did not use the standard notation in the paper (h means history). Correct me if I'm wrong:

2015-01-23 120723

the-moliver commented 9 years ago

Yup, your derivation looks correct. I'm surprised they didn't use that form in the paper. The only difference with standard momentum then is that the parameters are updated with the history and gamma at time t, rather than t-1, which is kinda cool.