scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.19k stars 25.42k forks source link

Investigate Adagrad for SGD #1209

Closed amueller closed 10 years ago

amueller commented 12 years ago

See here: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-24.pdf

MechCoder commented 10 years ago

Hi, @amueller : I tried looking at the research paper, but I quite don't understand how to do the Primal-Dual Subgradient Update and Composite Mirror Descent Update

vene commented 10 years ago

I only skimmed the paper, but the implementations I found with google just accumulate squared gradients as you go. Then the learning rate is computed based on that. See the pylearn2 implementation.

If I understand correctly from the paper, this, followed by the projection, is the composite mirror descent update with diagonal matrices (bottom of page 4).

However I anticipate that this poses some tricky implementation problems. If the gradient is large and sparse you don't want to naively go through it twice every iteration (once for the accumulation, once for the norm). Could this be done faster or is it unavoidable? It seems the paper discusses some solutions for this at the end.

MechCoder commented 10 years ago

Hi @vene , Thanks for pointing me to the PyLearn implementation. I understand that they have a accumulator that accumulates the square of the gradient for each parameter and a learning rate list that accumulates the learning rate for each parameter.

However I still don't understand how the final projection / composite mirror descent update, ie, the optimisation problem (either using scikit-learn or otherwise) is done. Would you be able to provide me an example? I'm relatively a newbie and I'm sorry if this is obvious or if I'm missing something. Thanks.

vene commented 10 years ago

I'm not sure, isn't it the regularization-induced projection? Looking at the PyLearn code I think that the only thing they change when turning adagrad on in SGD is the learning rate schedule. I might equally well be missing something.

amueller commented 10 years ago

Actually I don't think this is very relevant any more. It might help for text data, but for dense data I'm pretty sure it won't. There are newer, more relevant papers.

amueller commented 10 years ago

I'm closing this as I don't feel this is a pressing issue. Feel free to play with it, though. If you see increased performance on sparse datasets, it would be great to see a PR including benchmarks.

mheilman commented 10 years ago

Hey @MechCoder, if you're still interested in adagrad, I came across this tutorial that might help: http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf

(I saw a few papers using adagrad at ACL this year for NLP problems.)

MechCoder commented 10 years ago

hey, thanks but I think @dsullivan7 is working on it right now. This might be intersting :)

dsullivan7 commented 10 years ago

Awesome! I'm planning on adding adagrad as well as adadelta and an implementation of ASGD shortly, thanks for the link to the paper @mheilman!