soumith / convnet-benchmarks

Easy benchmarking of all publicly accessible implementations of convnets
MIT License
2.67k stars 579 forks source link

[Feature Request] Benchmarking - RNN/LSTM #36

Open AjayTalati opened 9 years ago

AjayTalati commented 9 years ago

Please close this if it is inappropriate

It seems that benchmarking of RNN/LSTM is a worthwhile thing to do now as these things are getting amazing results, many of them being computed with novel training algorithms either in Torch or Theano

Also demonstrating which implementations are easiest to create novel architectures with - so basically nngraph, verses whatever else there is out there ???

Another important point for researchers is that we have to get our sums right - whatever the tools we use to compute our probabilistic graphs with. The same model and training method, implemented in Torch/Theano/PyBrain3/RNNLIB/whatever, should give very close numerical results, over a range of hyper-parameters. So this could develop into something more than a one off speed test for common datasets.

I propose that as part of the algorithm/system testing procedure, new research models and training algorithms should by default be independently implemented in both Torch, and a second 'calculator' to test the accuracy of the claimed results. This step should follow quite soon after the fundamental research has been done, and promising results have been achieved using some calculator. It should come before these new tools are passed on upstream to software engineers and data scientists.

Ideally it should be conducted by a team independent of the original researchers. The skills required for this are a strong enough mathematical background such that research documents/papers can be read, derivations/proofs can be checked, and testing/validation can be done with the minimal amount of code, basically starting from scratch. Since the object of these tests is to validate accuracy and speed of algorithms, there's no need for vast computational resources or datasets either.

This type of multiple route testing/validation is the standard procedure in quant financial/investment trading systems development, and it can easily be adopted here.

AjayTalati commented 9 years ago

So to make some concrete progress on this, I think we need a really simple textbook problem and test data set, suitable for LSTM.

At the moment the simplest problem I can think of is predicting the next bit in a dynamic binary n-gram, based on a betabinomial distribution. The advantage of this is we're using a simple generative model, and so we can calculate the optimal estimator, hence the cost exactly. Also this simple Markov chain model is nice because its a first step towards latent variable generative models.

All the details for this are given in Kevin Murphy's textbook, Machine Learning A Probabilistic Perspective, 2012 edition, sections 3.3.4 p77 and 17.2.2 p591, so it should be very easy to code up and test this sequence generator in octave/matlab/R, and then to port it over to lua, python or C++.

The Torch LSTM code is given in the slides for this youtube video, which explains it a bit.

The point of this exercise is - I don't fully understand the above Torch LSTM code, (and no one has given me a convincing explanation yet), and so I want one very simple 'clean' example where I understand every line of the code.

Once I've done this in Torch, I can try to solve this problem using Theano, hence get clean benchmark implementations using both calculators, using the same optimizers. In particular I want to do this without using any extra 'junk', so the comparisons between the two calculators are very clear, (so I can spot if I've made a mistake)

soumith commented 9 years ago

I think you're powering through this faster than I am. The only thing that I would suggest is that make the benchmark on what people think matters most, i.e. benchmark RNN-LSTM configurations from published papers. I am happy to help in any way.

soumith commented 9 years ago

*published and hotly discussed papers, papers that matter to the audience now and today.

AjayTalati commented 9 years ago

Some new open source code for LSTM in Torch, from the Oxford University Computer Science Masters Course.

https://github.com/oxford-cs-ml-2015/practical6

It looks pretty good, seems to be based a lot on learning to execute? I don't understand it yet, so all insights are very warmly welcomed :+1:

soumith commented 9 years ago

It uses nngraph, which I guess you are already familiar with, looking at your comments from the NTM repository.

It is simply constructing the LSTM cell for a given time-step.

I am not sure of that particular repository (oxford-cs-ml-2015), but learning to execute and lstm work out of the box afaik.

nouiz commented 9 years ago

Just to tell that for Theano, there is a tutorial with a working lstm example there:

http://deeplearning.net/tutorial/lstm.html

I won't have time to work on it, but I can answer questions on the code.

On Wed, Mar 11, 2015 at 12:50 AM, Soumith Chintala <notifications@github.com

wrote:

*published and hotly discussed papers, papers that matter to the audience now and today.

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/36#issuecomment-78204804 .

AjayTalati commented 9 years ago

Hey Soumith,

I am happy to help in any way.

Thanks man for the show of interest and support - hopefully I'll get some results in the next few days!

It is simply constructing the LSTM cell for a given time-step.

Yes I think so too. They've chosen a very ambitious problem, and made it very complicated for a tutorial? Not sure a student's going to get much out of just running the code?

Their hearts in the right place though, seem's they've tried to make the code more like the mathematical/verbal description of a LSTM. I'm sympathetic to these guy's because they tried to make some difficult code, easier to understand, and more flexible too.

I'm not sure it's a good idea to start with a multilayer LSTM. I'm going to start with a one layer LSTM - that's the simplest config to start with - best to start simple and get results. I know I'm shallow

This is the Oxford code for a single layer, with 256 cells, given in LSTM.lua

lstm_master_cell

AjayTalati commented 9 years ago

Hi Frédéric,

I won't have time to work on it, but I can answer questions on the code.

that's great that you're offering to support Theano!

At the moment I'm not very productive because everything I do with one calculator (Torch or Theano), I have to do with the other, and I'm quite impatient to get results!

So at the moment, the two single layer LSTM pieces of code I'm working on are,

So to make some quick progress here, may I ask do you know of any simple tests, using RNG based sequence generators, I could apply to check the code of a single layer LSTM? The only thing I can think of is binary, n-grams with n=3 or 4, as I don't think a single layer LSTM would work for more complex n-grams. I'd prefer not to use any datasets, and simply to use a probabilistic sequence, which I can calculate the optimal (Bayesian) estimator for - it just seems to be the simplest LSTM configuration and purest test to start with?

Best regards

Aj.

kyunghyuncho commented 9 years ago

@AjayTalati

I don't fully understand what you mean by "no internal state tracking". If there's no internal state, by definition is not a recurrent neural network (hence, no LSTM.) What did you mean by this?

AjayTalati commented 9 years ago

Hi, kyunghyuncho

yes you're absolutely right, my wording doesn’t make any sense?

I'm interested in using LSTM for encoder and decoder networks. For this particular application any RNN will do, but LSTM are more interesting.

So let RNN^enc be a function enacted by the encoder network at a single time-step t. It outputs a hidden vector h^enc_t, and receives some input vector x, and the previous encoder hidden vector h^enc_t-1. For my problem that's all I need, for the other calculations in the system.

So I guess to clarity things, if I want to uses LSTM networks for the encoder (and decoder), I need to add that LSTM^enc in addition, outputs c^enc_t and receives c^enc_t-1, but these are not used for any other calculations in the system, (except for propagating the LSTM).

Does that help/make sense?

I'm very happy to work with people in both the Torch and Theano communities on this.

Thanks a lot for your helpful comment,

Cheers,

Aj

kyunghyuncho commented 9 years ago

I think I understand what you meant by 'internal state tracking'. So, essentially, you want to do language modelling (next symbol prediction) with RNN (but not necessarily with LSTM).

Now, let me restate what you want: a small task (dataset) and two different implementations of RNN (implemented using Theano and Torch). Is this right?

AjayTalati commented 9 years ago

Hi, think you've made the mistake of trying to fit applications to the test problems I suggested?

I think I understand what you meant by 'internal state tracking'. So, essentially, you want to do language modelling (next symbol prediction) with RNN (but not necessarily with LSTM).

I have no real applications in mind for my work. At the moment I'm actually working on just generally building recurrent variational auto-encoders. So variational autoencoders were introduced by Durk Kingma & Max Welling recently. They originally used MLP for approximating the posterior, but this was for simple illustration purposes, and their theory is much more general. It's possible to use a RNN, (or a LSTM), as the approximator for the posterior, and that gives a new class of variational auto encoders, suitable for generative time series modelling, (which as a special case could be based on Gaussian latent variable models). If you are interested, I suggest Justin Bayers recently updated paper,

LEARNING STOCHASTIC RECURRENT NETWORKS

I think he used python/theano for his examples, as well as Alex Graves, now classic report,

GENERATING SEQUENCES WITH RECURRENT NEURAL NETWORKS

Which I'm sure you would have read.

Now, let me restate what you want: a small task (dataset) and two different implementations of RNN (implemented using Theano and Torch). Is this right?

Yes, that's close to what I would like to do. In order to test my code I'd prefer a Markov Chain model, rather than using a real dataset. So flicking through my textbooks, the one simple sequence generation example that I remembered which had a simple form for the optimal estimator was the beta-binomial. All the details for this are given in Kevin Murphy's textbook, Machine Learning A Probabilistic Perspective, 2012 edition, sections 3.3.4 p77 and 17.2.2 p591.

This is also task 4.4 in the Neural Turing Machine paper.

nicholas-leonard commented 9 years ago

@nouiz @kyunghyuncho good job on that LSTM tutorial guys. Looks great.

kyunghyuncho commented 9 years ago

@nicholas-leonard Thanks! And, for proper credit assignment, thanks to @carriepl

AjayTalati commented 9 years ago

@nouiz @kyunghyuncho may I ask if you guys are interested in coding a LSTM variational autoencoder using Theano? Is this a project which would be fun for you both, and other Theano people?

I definitely am committed to doing it, and am actively working on it in Torch. I know @nicholas-leonard wants something like this for DP, and I think @soumith also likes this.

If everyone is happy then maybe developing a LSTM variational autoencoder would be a good benchmark problem to work on?

AjayTalati commented 9 years ago

So coding the LSTM variational autoencoder is straightforward using nngraph and the tricks we learnt from Oxford University on Wednesday.

I'm guessing that it's going to be fairly straightforward in Theano too? So if @kyunghyuncho and @nouiz don't want to do it, I might give it ago when I have my full system working and tested in Torch.

At the moment this seems to be the most interesting benchmark problem - one I could do on my own anyway.

Here's the computational graph. Inputs from t-1 are in lime, outputs for this time step t are in cyan, and the gold bit is where the Monte Carlo magic happens

The LSTM encoder is the top group of nodes, the middle bit is where the sampling of Q(z|x) takes place, and the bottom set of node is the LSTM decoder

system_forward_graph

kyunghyuncho commented 9 years ago

Unfortunately, I don't have time to implement STORN at the moment.

On Fri, Mar 13, 2015 at 10:54 AM, Ajay Talati notifications@github.com wrote:

So coding the LSTM variational autoencoder is straightforward using nngraph and the tricks we learnt from Oxford University on Wednesday.

I'm guessing that it's going to be fairly straightforward in Theano too? So if @kyunghyuncho https://github.com/kyunghyuncho and @nouiz https://github.com/nouiz don't want to do it, I might give it ago when I have my full system working and tested in Torch.

At the moment this seems to be the most interesting benchmark problem - one I could do on my own anyway.

Here's the computational graph. Inputs from t-1 are in lime, outputs for this time step t are in cyan, and the gold bit is where the Monte Carlo magic happens

The LSTM encoder is the top group of nodes, the middle bit is where the sampling of Q(z|x) takes place, and the bottom set of node is the LSTM decoder

[image: system_forward_graph] https://cloud.githubusercontent.com/assets/10279780/6640625/9c625d9a-c990-11e4-9384-3e3c16b1757f.jpeg

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/36#issuecomment-79012726 .

nouiz commented 9 years ago

Same for me. I'll be busy for a few weeks at least. But I can answer question on the code. Le 13 mars 2015 10:59, "Kyunghyun Cho" notifications@github.com a écrit :

Unfortunately, I don't have time to implement STORN at the moment.

On Fri, Mar 13, 2015 at 10:54 AM, Ajay Talati notifications@github.com wrote:

So coding the LSTM variational autoencoder is straightforward using nngraph and the tricks we learnt from Oxford University on Wednesday.

I'm guessing that it's going to be fairly straightforward in Theano too? So if @kyunghyuncho https://github.com/kyunghyuncho and @nouiz https://github.com/nouiz don't want to do it, I might give it ago when I have my full system working and tested in Torch.

At the moment this seems to be the most interesting benchmark problem - one I could do on my own anyway.

Here's the computational graph. Inputs from t-1 are in lime, outputs for this time step t are in cyan, and the gold bit is where the Monte Carlo magic happens

The LSTM encoder is the top group of nodes, the middle bit is where the sampling of Q(z|x) takes place, and the bottom set of node is the LSTM decoder

[image: system_forward_graph] < https://cloud.githubusercontent.com/assets/10279780/6640625/9c625d9a-c990-11e4-9384-3e3c16b1757f.jpeg

— Reply to this email directly or view it on GitHub < https://github.com/soumith/convnet-benchmarks/issues/36#issuecomment-79012726

.

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/36#issuecomment-79017635 .

AjayTalati commented 9 years ago

Hi @nouiz and @kyunghyuncho

No problem guy's I can understand that this is not a high priority for anyone.

It's really nice that you both took the time to show your interest - I appreciate that a lot.

The general approach of checking things using a different calculation method, is important to me so if I run into problems with my Torch system - which is likely - I'll definitely fall back to a Theano implementation.

A working bare bones system is less than 250 lines of Torch code, I expect it would be about the same for Theano. I'm happy to explain it to you, as well as help to implement it. Hopefully we can catch up in a couple of months.

Best regards,

Aj

kyunghyuncho commented 9 years ago

Ajay, feel free to contact me or @nouis, if you run into any issue implementing LSTM with Theano.

On Fri, Mar 13, 2015 at 6:42 PM, Ajay Talati notifications@github.com wrote:

Hi @nouiz https://github.com/nouiz and @kyunghyuncho https://github.com/kyunghyuncho

No problem guy's I can understand that this is not a high priority for anyone.

It's really nice that you both took the time to show your interest - I appreciate that a lot.

The general approach of checking things using a different calculation method, is important to me so if I run into problems with my Torch system - which is likely - I'll definitely fall back to a Theano implementation.

A working bare bones system is less than 250 lines of Torch code, I expect it would be about the same for Theano. I'm happy to explain it to you, as well as help to implement it. Hopefully we can catch up in a couple of months.

Best regards,

Aj

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/36#issuecomment-79503987 .

AjayTalati commented 9 years ago

Hey @nicholas-leonard and @soumith how's it going? I've got a torch version of the DRAW paper implemented as development code. I'm still testing the attention mechanisms for the full version of the paper, but training this thing takes forever!

I was just wondering if you guys, (or anyone else), wants to help out with optimizing the code and testing it? I'm a problem solver, (research physicist/mathematician), not a pro software engineer or a data scientist. So if you guys can take it to the consumer/pro level, we could get a test going of it verses this Theano/Blocks implementation of the DRAW paper. I've got no idea of how Blocks works, don't have the time to read it's code, and clueless how to adapt it to different problems?

Fair play to them though, annealing the Adam learning rate, setting the number of glimpses/iterations/cloning of the LSTM sequence to 10, (I was using 64), and their KLD plot are all really cool, and I'm looking forward to trying this stuff :+1:

To be honest I like my development code with all its readability, error catching, and print outs :+1: - at least I wrote it all, understand it, and can quickly diagnose/change it & use it for other projects :100: I've got some interesting modifications/applications in the pipeline :+1: It's bare bones, all the modules are written as required, (there's no extra junk), the only required torch packages are nn and nngraph. So it should be possible to easily fit it into DP, and the new rnn library, which would be a perfect collaboration and environment for testing it and some other cool stuff :+1:

Best,

Aj

sampler_module_backward_graph

encoder_module_backward_graph

decoder_module_backward_graph

soumith commented 9 years ago

@AjayTalati if you give me access today, i can help you optimize it. I have some free time today. This is awesome, that you've gotten everything up and working. Let's chat on gitter.

AjayTalati commented 9 years ago

Sorry guys, I'm getting bad results, AGAIN?

Seems I'm running into instabilities, of my encoder cell values, after the middle of training. That's c_enc_t in the above. For a few small toy data sets, I tried it worked OK - but for the reasonable sized training sets it's screwed.

Durk Kingma, just posted some great great advice - here. That guy's really smart! So I'm going to follow his advice, and try using batch normalization - this might take a while?

To be honest batch norm looks a bit complicated/involved, I might just try using a simple L2 cost penalty on the terminal cell values of the of the encoder, ||c_enc_T||_2 ? I like simple things

Bottom line - it ain't working, and ready for you guys to optimize yet - maybe next weekend - apologies.

skaae commented 9 years ago

I didn't read the entire thread, but here is a few suggestions. I did a DRAW implementation in theano it is available here: https://github.com/skaae/lasagne-draw Feel free to ask if you have any questions.

I modified the oxford LSTM code a little and got a ~40% speedup by concatenating the dot products and slicing. https://gist.github.com/skaae/ea4320e17379d408e693

AjayTalati commented 9 years ago

Hi Soren,

thanks a lot for the suggestions - I've got a Torch implementation of DRAW, with attention working now.

I'm just trying to train it for two & three digit MNIST at the moment. To be honest I've found training DRAW with attention for two digits to be much more time consuming than for one digit MNIST - it's taken me two weeks of trial & error, to get it looking right? In the end I've got fed up and added dropout and got better results - still training but will post a video latter.

training_draw_with_attention_on_2_digit_mnist

Thank you very much for the optimized LSTM code, I'll try it later, and will give you feedback. Hopefully I'll be able to tidy up my code and post it on Github soon!

Best regards,

Aj

Edit - I just saw that you've managed to generate digits conditioned on their class - your second plot - that's REALLY COOL :+1: I have'nt done that yet? How did you do it?

skaae commented 9 years ago

I concatenated the latent z with y during training. Then you can conditionally generate digits by clamping y during the generation. On what hard ware are you training the network? I did it on a K40 which took around 2 days.

AjayTalati commented 9 years ago

I concatenated the latent z with y during training. Then you can conditionally generate digits by clamping y during the generation.

That's really cool thanks :+1: I'll have to have a think about how to implement it?

Unfortunately it takes forever to train 2 digit MNIST - at least with my implementation. IIRC one digit MNIST with attention, and 64 timesteps, and 256 LSTM cells, took 2 days as well! Sorry my wording above was poor, it's taken me 2 weeks of trial and error in hyper parameter tuning and messing around to get a stable system. In terms of actual training time, it takes about 48 hours in wall clock time, and about 25K minibatches, (2.5 million training images) to get a decent system for 2 digit MNIST.

I'm not too clued up on hardware, this is what I think is in the box?

Cheer's

Aj

AjayTalati commented 9 years ago

Hey folks :)

here's a video of the best 2 digit MNIST model I've managed to train so far with my Torch implementation of DRAW. Not great?

I was just wondering if anyone with a Theano implementation of DRAW has managed to train anything better on this data set?

It took 3 days to train and got a negLL lower bound of 223 in the end. The hyperparamters are about the same as in section 4.3 of the DRAW paper, except I added dropout -

imagesize - 60x60 dataset - MNIST 28 binarized timesteps/glimpses - 32 dropout mask prob - 0.15 ADAM learning rate - 3e-4 minibatchsize - 100

To date I still hav'nt managed to train a good 64 timestep model on the 2 digit MNIST dataset, and it's REALLY BUGGING me :( !!! I'm guessing that a 64 timestep model would get a negLL bound of sub 200 on the 2 digit dataset, but that's conjecture at the moment.

For any implementation (Torch or Theano), I think you need to get a sub 200 negLL on this dataset to get any credibility for your implementation. If you can do that, it's fairly convincing that your system actually works :+1: At the moment I'm guessing this would take around 30K minibatches, (or more), so at a rate of 10K minibatches per day, it's going to be a long 3 or 4 days :-1:

soumith commented 9 years ago

whats the one on the left vs the one on the right?

AjayTalati commented 9 years ago

The one on the left is the input image, (the data), that's presented to the DRAW reader network. The one on the right is the reconstructed image, (probabilistic model of the data), generated by the DRAW writer network.

So the reader with attention focuses where the green box is on the input image, the one on the left. The output of the reader goes into the encoder, which sends a code, via a noisy channel, to the decoder, which controls the writer. The writer focuses it's attention where the red box is, adding/subtracting probability mass to the reconstructed image at each timestep. The decoder also controls where the reader focuses it's attention on the next timestep.

Apologies mate - it's not that easy to explain ???

Edit - I think the paper has a different way of animating/defining the green and red bounding boxes - this is just something I hacked together on the fly?

nicholas-leonard commented 9 years ago

Can it sample double digits from random latent variables instead of an input?

Nicholas Léonard 450-210-1214

On Fri, May 1, 2015 at 6:35 PM, Ajay Talati notifications@github.com wrote:

The one on the left is the input image, (the data), that's presented to the DRAW reader network. The one on the right is the reconstructed image, (probabilistic model of the data), generated by the DRAW writer network.

So the reader with attention focuses where the green box is on the input image, the one on the left. The output of the reader goes into the encoder, which sends a code to the decoder, which controls the writer. The writer focuses it's attention where the red box is, adding/subtracting probability mass to the reconstructed image at each timestep.

Apologies mate - it's not that easy to explain ???

Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/36#issuecomment-98256123 .

AjayTalati commented 9 years ago

Hi Nick,

yes it can do that now. For two digits, picked out of the set (0,1,2,,,,9), there are 55 possible unordered combinations with repetition, i.e. (11!)/(2!9!) = 55.

If I train the network on input images, and this classification vector of length 55 - as @skaae described above - for stochastic data generation, I can pick any of the 55 classes, and generate a probabilistic model for it. So I could generate a fantasy/probabilistic model for say the unordered combination 0 and 7, if requested.

I'm just trying to train a 64 timestep system at the moment, because the quality of the 55 fantasy pairs of digits, is not so good for the 32 timestep systems I've trained? Going from 32 to 64 timesteps is turning out to be not so easy? LOTS OF NUMERICAL PROBLEMS, especially towards the end of training, when it starts to overfit ????

lim0606 commented 9 years ago

Hi!

I hope you @AjayTalati are still having interests in implementing DRAW.

I'm having trouble with understanding the selective attention model in DRAW. I'm not sure here is the write place to ask, but it seems the only place where people showed their interests in DRAW implementation.

Specifically, I want to understand 1) how Fig. 3 in the DRAW paper is drawn and 2) why the authors did not normalize w_t in eq. 28.

As far as I understood, the differentiable attention mechanisms (esp. the ones in the two ref papers mentioned in DRAW paper) are basically Gaussian mixture. In that sense, I have no idea how to draw Fig. 3 (clean squared rectangle). Is the rectangle just drawn with the size of (N-1)_delta x (N-1)_delta and the thickness of sigma?

In the same sense, considering the soft attention as Gaussian mixture, it is natural to normalize w_t in eq. 28; however, it didn't. I think there should be a reason, but I couldn't find it.

Can anyone help me to understand the paper properly?

AjayTalati commented 9 years ago

Hi @lim0606

The size of the zoomable attention boxes in the video and fig 3, are controlled by the stride eqns 24 and 21. These are trainable functions of the previous decoders output h^dec_t-1 for the reader, and the current decoders output h^dec_t for the writer.

w_t in eqn 28 is a NxN patch of probability mass that the decoder wants to add to the canvas IIRC, not a weight. The writers filterbank matrices in eqn 29 control where on the canvas this probability mass is added/substracted.

Have a look at Jorg's implementation here, it will be much easier for you to understand than my code!

Hope that helps,

Cheers,

Aj

lim0606 commented 9 years ago

Thanks a lot @AjayTalati

I think I get it :)

Jaehyun Lim