kaishengtai / torch-ntm

A Neural Turing Machine implementation in Torch.
279 stars 54 forks source link

ADAM instead of RMSprop #3

Closed ghost closed 9 years ago

ghost commented 9 years ago

The ADAM optimizer is now available in the torch optim module.

https://github.com/torch/optim/blob/master/adam.lua

Might be interesting to swap RMSProp for this

kaishengtai commented 9 years ago

Thanks for the heads up -- I'll give ADAM a try. If you decide to test it out yourself, it would be great if you posted your impressions here.

ghost commented 9 years ago

Hi Kai,

of course I'll post any feedback I can give on the ADAM optimizer when I get it working reliably? I just had a look at the ADAM paper again, and I'm pretty sure that it's not a good idea for the two tasks that you've implemented. I think its a necesary condition that the objective has to be stochastic, which is not the case for the copy task or associative recall -- but it might work for the dynamic n-gram task?

If you do want to try it -- the lambda default bug in optim.adam was fixed 2 days ago -- so maybe you want to reinstall the optim rock fresh? With a fresh pull it does'nt work straight out of the box with the default parameters -- that is with

local adam_state = {} optim.adam(feval, params, adam_state)

replacing the commented out adagrad optimizer in copy.lua? If you try it on the dynamic n-gram task amortizing the learning rate might help as stated here on page 3. Alot of smart people are using it instead of RMSprop, on problems with stochastic objective functions, so it must work for the right sort of applications.

It's stated that the ADAM optimizer is used by Alex Graves (instead of his previous version of RMSprop) in a new task/application of the NTM. I guess you've seen this already -- but incase you hav'nt -- Alex Graves and Ivo Danihelka's latest

DRAW: A Recurrent Neural Network For Image Generation

there's a youtube link given in the introduction of that paper -- the results in table 2 are most encouraging :) I guess it gives credibility that combining NTMs, and the novel type of recurrent VAE they use is a promising avenue?

I'm interested in reproducing the DRAW paper eventually - but at the moment I'm still working on understanding/self documenting/getting the NTM paper to work, (in particular your implementation), and VAEs.

I really appreciate you posting your NTM project alot, it's impressive! Thanks :)

Best, Aj

PS - May I ask what tools you use to debug your Torch7 projects? Are zbs-torch or Lua Development Tools (LDT) and Graphviz with nngraph the things you use?

ghost commented 9 years ago

Here are a couple of graphs produced with nngraph and Graphviz

master_cell = model.master_cell graph.dot(master_cell.fg, 'master_cell_backward_graph', 'NTM_master_cell_backward_graph')

ntm_master_cell_backward_graph

initial_module = model.init_module graph.dot(initial_module.bg, 'initial_module_backward_graph', 'NTM_initial_module_backward_graph')

ntm_initial_module_backward_graph

Unfortuntely, they don't really help that much understanding what's happening? I though there might be an intuitive way of interpreting the master cell graph, in terms of the concepts of the paper?

I think I should close this issue as Adam does not work (for me), and trying to use Graphviz, (and ZeroBraneStudio), to understand things are not really very useful either?

If you have any suggestions for improving your existing package, I'd be very happy to work on them :+1: At the moment I'm trying to reproduce figures 3 to 6 from the paper -- basically simplifying your existing code -- to implement a vanilla LSTM model, and a NTM with a ff controller like they do in the paper. With your existing code it should already be possible to reporduce figues 4 & 6?

Apologies, Aj :-1:

ghost commented 9 years ago

It turns out that annotating the nngraph for NTM.lua is quite interesting, at least for someone who's not sure of all the code & the paper yet?

ntm_master_cell_forward_graph

kaishengtai commented 9 years ago

Hi Ajay, sorry for the slow response. As is apparent from the code and your graphviz plots, the current implementation is complicated by the state that needs to be carried over between iterations of the NTM. Using the feedforward controller (like in Shawn Tan's Theano implementation) would definitely simplify things considerably.

Regarding ADAM / optimization -- from my tests, it seems that some kind of RMSProp-like procedure with an exponentially decaying RMS term in the denominator is the way to go. I've tried AdaGrad and AdaDelta but haven't been able to make much progress with either method. From my brief skim of the paper, ADAM seems to fit the bill, at least cosmetically, though it's of course unclear what hyperparameters would work best.

Regarding debugging -- I've found that inserting Print layers in suspected problem spots has been very quick and useful. In addition to the tools that you listed, there's also Facebook's lua debugger, though I haven't tried it personally. I'm not aware of any great utilities for debugging code that uses nngraph; it's definitely the case that errors can sometimes be quite opaque.

ghost commented 9 years ago

Hi Kai, thanks alot for your advice and the tools!

I have seen Shawn Tan's implementation, but I'm actually finding your implementation easier to understand ??? I'll have another look at it tomorrow.

I'm starting to get the feeling for realistic applications the complexity of the graph is unavoidable? Definitely to further improve the NTM it's only going to involve adding more modules - the LSTM controller seems to be worth the added complexity? So I'm comfortable now with doing things in a modular object orientated way, using torch. It's really impressive that you managed to do this :)

I pulled your first commit - with single heads - and am trying to use that code to implement a ff controller. I'm also trying to put a switch in the exisiting code that will allow the implementation of a plain LSTM model/zero heads. If I do this in a principled way, that would make things easy to understand for novices, and could reproduce the figures in the paper. If you leave implementing Adam in the NTM to me I can work on that without wasting your time. Are you working on the dynamic n-gram task?

I've got some other interests that I would like to apply the NTM to - generative models like stochastic versions of recurrent networks and variational autoencoders, and trying to implement them with sampling based inference. I have bits of code for these things. Combining these algorithms with external memory/NTM, at the moment seems to be the most general probabilistic machine I can think of - so its fun :+1:

May I ask if you have any real datasets/problems that you are interested in applying the NTM to? Sequence prediction problems seem like a realistic goal. I see you have a theoretical physics background - me too. This looks interesting and there's torch code for it as well.

I went to a tutorial by one of the Deepmind researchers last week on variational inference - combining VI with generative models and external memory/NTM is what they currently are working on.

ghost commented 9 years ago

Sorry, I forgot to include a link to a variational autoencoder -- the codes in torch -- so perhaps when I get the feedforward NTM working -- I could experiment with implementing it in the simplest possible case. That codes for MNIST, so training for the copy task gets rid of batches -- basically it all fits onto one page.

That's a realistic goal, (for me at least), and it would nicely demonstrate the modularities between controller/neural network, external memory and autoencoder/decoder.

That code above is for MNIST so it simplifies a little by getting rid of batches -- after that it all fits onto one page.

ghost commented 9 years ago

Hi Kai,

the adam optimizer was fixed again yesterday, and it works now on the Rosebrock test problem - please pull optim fresh if you want to try this.

I tried using adam for the copy task last night, I got some convergence for 50K with max length 20, but not particularly stable.

I've been looking again at Shawn Tan's code and his blog - he use's circulum learning which was one of the things I did'nt understand about his code.

I just tried using Adam for the copy task, (with it's default params), with max length 2, and it converges after about 10K iterations, with good stability after 20k. I now understand why there is so much screen output in your code - it makes sense now! So I'm guessing that hyper-parameter tuning for adam, with your existing choices of rectifiers, might give some performance improvement?

I also got whetlab-client yesterday - it's a hyperparameter, tuner. Which works pretty good - if you can contruct a sensible training signal - I guess a simple running average of the loss over the last 500 iterations is reasonable. Probably running it for 5k iterations and then calling for a suggestion of new hyper params is the way to go?

update - I just saw the dp package here in torch has some hyperparameter optimization capability.

May I ask your advice? Is finding good hyper-parameters for Adam a high priority? I guess things will be speeded up if I implement/use a feed forward controller instead of LSTM, but I still hav'nt got that working yet?

ghost commented 9 years ago

HI Kai,

a new version of the Adam paper was released on Tuesday. I checked the coded and pushed a new commit to the optim package - and it was just merged.

So if you pull that package fresh and run adam.lua with default hyper-parameters you should see it give good convergence a bit quicker than RMSprop. I found it works after about 4000-4500 iterations. So its just a shade quicker.

This is it's performance, 'straight out of the box', so I expect with some tuning of the learning rate and decay lambda, it might work a little better?

Perhaps you can give it a try, if you're curious?

Best Aj.