torch / optim

A numeric optimization package for Torch.
Other
197 stars 153 forks source link

[Feature request] A suitable test problem for Adam #47

Closed ghost closed 9 years ago

ghost commented 9 years ago

Could I request a test problem for the Adam optimizer, just to understand how it works better? Thanks :+1:

I tried to use the new adam optimizer on the rosenbrock test problem in optim, used for adagrad, but it does'nt seem to work? I can't get it to work with a range of different config parameters for rosenberg, or for the ML problem I'm working on - simple copy tasks using the neural Turing machine?

I expect I've misunderstood the adam paper, so I'm doing something wrong/really dumb -- does the objective necessarily have to be stochastic -- for adam to be applied?

If so then I expect it would'nt work for rosenbrock, and the LSTM used in the neural Turing machine which is which is deterministic?

My failed rosenbrock attempt

require 'torch' require 'optim' require 'rosenbrock' require 'l2' x = torch.Tensor(2):fill(0) fx = {} config_adagrad = {learningRate=1e-1}

config_adam = adam_config = { learningRate = 1e-6, beta1 = 0.01, beta2 = 0.001 }

for i = 1,10001 do --x,f=optim.adagrad(rosenbrock,x,config_adagrad) x,f=optim.adam(rosenbrock,x,config_adam) if (i-1)%1000 == 0 then table.insert(fx,f[1]) end end

print() print('Rosenbrock test') print() print('x=');print(x) print('fx=')

OUTPUT

Rosenbrock test x=
0.01 * 2.0243 0.0523 [torch.DoubleTensor of dimension 2]

fx= 1 1
1001 0.96578919291079
2001 0.96476690379526
3001 0.96406793828219
4001 0.96344671327807
5001 0.96285032358036
6001 0.96226242754115
7001 0.96167791102736
8001 0.96111328576091
9001 0.96050921334115
10001 0.9599246681675

soumith commented 9 years ago

let's keep the issue open, it's a fair request.

ghost commented 9 years ago

O.K. -- I'm trying to get it to work for the train-a-digit-classifier, in the demos repo.

This is the simplest example they give in their paper -- they don't give any numbers, but sec 6.1 gives a figure for logistic regression, which I'm trying to reproduce?

They state their learning rate as alpha=0.0002, and I guess they use the cost ClassNLLCriterion() ???

For some reason getting the adam optimizer of the ground for me is finicky? It gets stuck sometimes and it can take 10-15 mins to do an iteration - alternatively when it gets off the ground it rockets!

I'm still confused :(

adam on mnist 2015-02-26 10 15 31

ghost commented 9 years ago

update - I just saw the author of adam.lua louissmit has a package VBNN which tries to do the above so I'll try using that?

ghost commented 9 years ago

MANY THANKS TO IVO DANIHELKA for checking the code, finding the square root error in stepSize and tidying up the code :+1: :+1: :+1:

I though I was going insane for a while?

I had quick scan of the learningRate with whetlab -- for learningRate = 0.002 you get very quick descent. I've put a pull request in for the Rosenbrock test - here's what I get

Rosenbrock test x=
1.0000 1.0000 [torch.DoubleTensor of dimension 2]

fx= 1 1
1001 0.0011442935064303
2001 4.4946375786256e-10 3001 2.0608977162588e-15 4001 8.1859410526868e-18 5001 2.8353825418007e-07 6001 2.2123975348011e-16 7001 7.6373686275132e-18 8001 3.6164809577749e-12 9001 8.5692907666899e-06 10001 3.3883759892041e-14

I'm trying it my neural Turing machine problem over night - but the initial result don't look good for real problems. I want a true stochastic optimization test problem like the python implementation of Dirk Kingmas NIPs paper - so can we keep this issue open please?

ghost commented 9 years ago

OK I ran adam on a torch implementation of the NTM overnight and it gives reasonable convergence, but 50K is too long compared to RMSprop which converges stability for copy task after 5-10K iterations.

I saw Shawn Tan's feedforward controller Theano implementation uses ciriculum learning, which seems like a bright idea :+1:
He uses adadelta, which took a long time for him to get working? I should really try a python version of adam with his implementation.

It's a total hack !!!! But training the neural turing machine, for very short sequences <=2, and embeeding it in Whetlab might give one method of getting a sensible learning rate, and lambda & epsilon for using Adam with the ntm.

Any advice very gratefully recieved!

ghost commented 9 years ago

There's some discussion of a good lambda setting here - https://twitter.com/ogrisel/status/568316753593434112

Apparently there might be a type mistake in v2 of the paper as stated here

https://twitter.com/ogrisel/status/568316753593434112

soumith commented 9 years ago

This PR: https://github.com/torch/optim/pull/46 addressed that issue according to those twitter comments already.

ghost commented 9 years ago

Thanks, Sousmith!

May I ask your advice -- is whetlab the way to go for general hyperparameter tuning? There's lua MNIST code that comes with it, (it's a naff MLP model), and they use SGD.

I'm happy to do all the 'donkey work' to get a good test problem for adam working in torch -- but so far I have'nt got anything convincing going for the ntm problem I'm working on, (or even the original adam paper)? Happy to work with anyone who wants to do this?

I'm not convinced yet of any Adam vs RMSprop benchmark test -- arvix papers are really great, but they're much more likely to get independently validated/published if they provide clean open source (torch) code to back them up? Same goes for the neural Turing machine?

soumith commented 9 years ago

How about you pick mnist, cifar and imagenet in that order. Compare sgd, rmsprop, adam, lbfgs etc.

ghost commented 9 years ago

Are there any hyperparameter tunning packages available in torch, or open-sourced that I could try to port over? Perhaps in R

Whetlab has a 5000 hyper-parameter suggestion per month limit for their free public beta.

I just saw spearmint by Jasper Snoek -- hopefully that should good enough for these small problems? Its in python, but at least its open source.

soumith commented 9 years ago

there's whetlab for torch, but you can use spearmint as well

ghost commented 9 years ago

Thanks Soumith!

The dp package here in torch, has some hyperparameter optimization. Have you seen it, or even used it for this?

I'm trying to read the code, to see how it works - looks like it could be used to implement Gaussian process Bayesian optimization using Gaussian processes?

Package looks like it could be useful for validating new code?

It's not obvious, (to me anyway), how to switch optimizers to adam? I'll ask Nicholas Leonard

nicholas-leonard commented 9 years ago

The dp code for hyperparameter tunning just samples from a user-defined prior. No gaussian process. Personally, I just hand tune hyper-parameters as they otherwise take too much time and its more difficult to learn what is actually going on. Of course, I would love to have something like whetlab to hyper-optimize models.

ghost commented 9 years ago

The Adam paper was updated on Tuesday 3rd March.

Thanks Durk & Jimmy :+1: :+1: :+1:

So I'm checking the torch code here, but I'm not the cleverest person, (bit thick to be honest).

Maybe someone who's smart, and is good with torch tensor arithmetic can also check, adam.lua?

Also I think annealing the learning rate is something worth trying?

Thanks :)

ghost commented 9 years ago

So with the default parameters of the latest version of the paper this is what I think the code should would be?

-- Decay the first moment running average coefficient local bt1 = beta1*lambda^(state.t - 1)

-- Update biased first moment estimates --state.m:mul(bt1):add(1-bt1, dfdx) state.m = torch.add( torch.mul(state.m, bt1) , torch.mul(dfdx, 1-bt1) )

-- Update biased second raw moment estimate --state.v:mul(beta2):addcmul(1-beta2, dfdx, dfdx) state.v = torch.add( torch.mul(state.v, beta2) , torch.mul( torch.cmul(dfdx, dfdx) , 1-beta2 ) )

state.denom:copy(state.v):sqrt():add(epsilon)

local biasCorrection1 = 1 - beta1^state.t local biasCorrection2 = 1 - beta2^state.t local stepSize = lr * math.sqrt(biasCorrection2)/biasCorrection1

--update x x:add( torch.mul( torch.cdiv( state.m, state.denom ) , -stepSize ) )

But it still does'nt work for me?

ghost commented 9 years ago

SUCCESS !!!!!!

I got it to work for the variational autoencoder which is used to generate figure 4 in version 4 of the paper so I'm very very happy :+1:

adam after about 30 epochs

If the learning rate is positive, (to be consistent with the other optimizers in optim), then update rule should be,

x:addcdiv(-stepSize, state.m, state.denom)

but apart from that the code is fine!

ghost commented 9 years ago

Very sorry - forgot to close this ???

Thanks for all your help everyone, Aj