oxford-cs-ml-2015 / practical6

Practical 6: LSTM language models
https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/
260 stars 83 forks source link

How to restart training of a saved model? #3

Closed AjayTalati closed 9 years ago

AjayTalati commented 9 years ago

Hi,

I've been experimenting with using the model_utils.lua file on some on my own concatenations of gModules. I was just wondering if you could give an example of how to use,

model_utils.combine_all_parameters

and

model_utils.clone_many_times

to get the params and grad_params of a saved protos which can then be used with some of the appropriate lines of train.lua to restart training?

Just to give some context - what I've tried is instead of saving the full protos, just saving the following table,

table_to_save = { options = opt , saved_params=params, saved_grad_params=grad_params }

Then used basically all of train.lua, with the following,

saved_data = torch.load(saved_filename)

opt = saved_data.options

params:copy( saved_data.saved_params )

grad_params:copy( saved_data.grad_params )

That is, I recreate the system using the same options and clone it in the same way - the main change is simply transferring the saved params and grad_params before starting the optimization.

I was just wondering if this is the right way to do it?

Thanks for your help :+1:

Best regards,

Aj

bshillingford commented 9 years ago

Hi, I recommend saving the cloned sequences of modules instead of the protos, just to make things easier. This would result in larger saved files of course, though, since the activations would be saved as well (but still the same number of weight matrices).

If you want to just serialize protos: In train.lua, lines 53 onward, the protos will have all their params pointing to subtensors of one shared tensor. So, to serialize just the protos, you can do that in the training loop periodically, I think I did that there. To re load the protos, wrap lines 43-53 in an if statement that checks if you want to load it from a file, then either recreate protos as lines43-53 already do, or deserialize it from a file.

The clone_many_times must be done after this (see line 54 for an explanation why).

Cheers,

Brendan

AjayTalati commented 9 years ago

Wow thanks for the quick reply :)

I don't really understand how you can get the params of the cloned sequence of modules, and put them into memory as a shared 1D tensor, that the optimizer can use?

I'm sort of confused because clones is a table of modules, not a single module? Is there some trick I'm missing?

bshillingford commented 9 years ago

Here's the sequence of operations:

  1. The prototypes are generated (protos)
  2. Their parameters are flattened by allocating a new tensor that holds all of their weights/biases so that the optimizer can access them easily, and recursively replacing the protos' parameters with new tensors pointing to this one tensor.
  3. Now we create the cloned sequence of modules using the prototypes... remember that no new params are allocated now, and each instance in the clone just has a reference to the same tensors.

I think step 3 is where you're a bit confused here. We create the sequences of clones from the protos after the parameters of the prototypes all point to the shared tensor for optim.

Does that help?

On Fri, Apr 17, 2015 at 5:26 PM, Ajay Talati notifications@github.com wrote:

Wow thanks for the quick reply :)

I don't really understand how you can get the params of the cloned sequence of modules, and put them into memory as a shared 1D tensor, that the optimizer can use?

I'm sort of confused because clones is a table of modules, not a single module?

— Reply to this email directly or view it on GitHub https://github.com/oxford-cs-ml-2015/practical6/issues/3#issuecomment-94021116 .

AjayTalati commented 9 years ago

Yes - I think so? :+1:

So the variables, params and grad_params are basically addresses to the shared tensor, and should be sort of saved, (for want of a better word), when I save the clones closure? So when I reload the sequence of clones, the variables params and grad_params will be reloaded and still point to the shared tensor (which holds the weights/biases).

That seems a bit magical, but easy to do ? Basically I just reload the closure clones, and start the optimization loop - as before - no need to use either of the functions from model_utils.lua ?

bshillingford commented 9 years ago

Sorry for the confusion, in my first reply I listed two options:

  1. serialize only protos and params and grad_params to save space in the saved model, then recreate clones on load (I explained this one in the " lines 43-53 " paragraph in the first reply: just serialize {params,grad_params,protos} then recreate clones after you deserialize)
  2. just save clones, protos, and params and grad_params all together in a table

If you save clones like you just mentioned, that's correct as long as you serialize params and grad_params at the same time (e.g. put them all in one table {params,grad_params,clones,protos}), which I neglected to mention. No need to use model_utils.lua if you serialize everything together. IIRC protos isn't used past line 50 or so, so you probably don't need to serialize it. I'd serialize it anyway though.

(Torch's serialization system will see that their Tensors point to the same Storage objects as all the params inside the modules, and so they'll point to the same thing on deserialization too.)

Edit: To comment on your initial post: I missed part because I read it before you edited. Your way with copying parameter values would work as well, but option 1 above is the way I usually do it.

AjayTalati commented 9 years ago

Wow, thank's a lot for great explanation :+1: - I'm trying it now - it takes a long time for my model to train - so it's not easy to tell if it's working?? I think it is though.

So basically just saving and unpacking to their original names, the 'things' in the following table, i.e

table_to_save = {params,grad_params,clones,protos, opt}

save it, reload it and unpack it

params,grad_params,clones,protos, opt = unpack( table_to_save )

is all you have to do? All the parameter/memory sharing technicalities are magically reloaded. I did'nt think it would be that easy?

On a different level though its still kind of confusing/unsatisfying? This is a bit theoretical, so bear with me, but in terms of information theory and source coding, my system is a variational autoencoder. So if I add the number of bits that it takes for the algorithms of the luajit compiler, the essential modules of Torch I use, my VAE system's .lua files and the my trained models saved parameters, which is just a few million 64 bit numbers, I should have the amount of information that's coded into my algorithmic/generative model, which is basically a probability distribution, of my dataset which is cluttered MNIST32. So in total that lot I guess is 1Gig at the most.

If I clone my trained modules and then save them all, the full amount of data saved comes to about 5 Gig each time. So it just seems that in terms of information and data compression, its more satisfactory just to recreate the system fresh, using model utils, and then :copy the saved parameter numbers into the shared param tensor of the freshly (re) built system.

What do you think?

I need to do some more experiments just for my own sanity to make sure both methods work :+1:

bshillingford commented 9 years ago

That's the correct way to serialize, yes, and sharing/references are handled correctly. Without going into too much detail, there's a few different levels of a serialization system's complexity regarding pointers/references. In C/C++ notation, if &a.b == &c.b, then when serializing a and c together we'd expect &a.b == &c.b when deserializing too. In the case of parameter sharing, a and c are Tensors, and the b is the shared underlying Storage. Remember there's only one Storage for the parameters in the entire network. More advanced serialization libraries can correctly (de)serialize pointer/reference cycles (torch probably does as well, but I haven't checked, and this situation is probably rare for most torch code anyway).

The amount of space is large because the activations and gradients in each clone in the network are being serialized too (i.e. module.output and module.gradOutput for each module). The values in these are obviously useless. To avoid this, serialize just params,grad_params,protos,opt and recreate clones using clone_many_times when you start, or just use your solution of serializing parameter values and copying them (but remember to do this after calling combine_all_params).

On Fri, Apr 17, 2015 at 9:23 PM, Ajay Talati notifications@github.com wrote:

Wow, thank's a lot for great explanation [image: :+1:] - I'm trying it now - it takes a long time for my model to train - so it's not easy to tell if it's working?? I think it is though.

So basically just saving and unpacking to their original names, the 'things' in the following table, i.e

table_to_save = {params,grad_params,clones,protos, opt}

save it, reload it and unpack it

params,grad_params,clones,protos, opt = unpack( table_to_save )

is all you have to do? All the parameter/memory sharing technicalities are magically reloaded. I did'nt think it would that easy?

On a different level though its still kind of confusing/unsatisfying? This is a bit theoretical, so bear with me, but in terms of information theory and source coding, my system is a variational autoencoder. So if I add the number of bits that it takes for the algorithms of the luajit compiler, the essential modules of Torch I use, my VAE system's .lua files and the my trained models saved parameters, which is just a few million 64 bit numbers, I should have the amount of information that's coded into my algorithmic/generative model, which is basically a probability distribution, of my dataset which is cluttered MNIST32. So in total that lot I guess is 1Gig at the most.

If I clone my trained modules and then save them all, the full amount of data saved comes to about 5 Gig each time. So it just seems that in terms of information and data compression, its more satisfactory just to recreate the system fresh, using model utils, and then :copy the saved parameter numbers into the shared param tensor of the freshly (re) built system.

What do you think?

I need to do some more experiments just for my own sanity to make sure both methods work [image: :+1:]

— Reply to this email directly or view it on GitHub https://github.com/oxford-cs-ml-2015/practical6/issues/3#issuecomment-94067175 .

AjayTalati commented 9 years ago

Brilliant - thank you very much for the clear explanation :+1:

AjayTalati commented 9 years ago

Hi Brendan,

thanks for all the great help you've given me. Just to share with you a little trick I found.

i) Train a system for say n timesteps/clones of the master modules and save (serialize) the parameter, and grad parameter tensors, and the first clone in the method as you explained above,

ii) then rebuild your system fresh with an extra timestep/clone n+1, using model_utils.clone_many_times, on the first clone

I think this little trick is working, (at least for the variational auto encoder I'm working on :+1: )

AjayTalati commented 9 years ago

Hi, just an update on my suggest trick of restarting training with a rebuilt system with an added clone - after doing more controlled experiments - it does not seem to be working?

Basically I've found that there's no substitution for fixing a number of clones/timesteps and being patient, waiting for the system to start breaking it's symmetries.

My suggested trick of using the parameters of a shorter system, as the initial parameters of a system with an extra clone/timestep, seems to restrict the parameter space, and result in a higher final loss. The standard method of simply training using the desired number of timesteps, and being patient, or finding better ways to initialize the system, or other tricks, seems to result in a lower final loss.

Very sorry for the half-baked idea :-1:

AjayTalati commented 9 years ago

Thanks a lot for all your great help Brendan :+1:

I think anyone who reads through this issue will get a few choices of how to save and restart training networks.

Best regards, Aj

mszlazak commented 9 years ago

Nice if i could get the code to work.

https://github.com/oxford-cs-ml-2015/practical6/commit/96749c8d9bc93f864c94c048a3c8cd73f59f733b#commitcomment-10954747