koraykv / optim

Some optimization packages for torch7
11 stars 5 forks source link

ASGD has weight decay built in? #1

Closed rolfe closed 11 years ago

rolfe commented 11 years ago

The averaged SGD function implements x := (1 - lambda eta_t) x - eta_t df/dx(z,x) which includes L2 weight decay with decay constant lambda. The weight decay constant lambda also appears in the learning rate decay function eta_t = eta0 / (1 + lambda eta0 t) ^ 0.75

The ASGD papers I've read don't seem to require the use of L2 weight decay. Moreover, Xu (2010) - Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent - seems to imply that the lambda term in the learning rate decay function should be a multiple of the smallest eigenvalue of the Hessian. Looking at Bottou's SGD code, weight decay doesn't appear to be present in the CRF example, although it is in the readme file, from which the torch implementation seems to be derived.

Is the L2 weight decay an essential part of the ASGD implementation? Why is the weight decay constant tied to the learning rate decay function?

Thanks, Jason