Closed AjayTalati closed 9 years ago
Hi, I recommend saving the cloned sequences of modules instead of the protos
, just to make things easier. This would result in larger saved files of course, though, since the activations would be saved as well (but still the same number of weight matrices).
If you want to just serialize protos
: In train.lua, lines 53 onward, the protos
will have all their params pointing to subtensors of one shared tensor. So, to serialize just the protos
, you can do that in the training loop periodically, I think I did that there. To re load the protos
, wrap lines 43-53 in an if statement that checks if you want to load it from a file, then either recreate protos
as lines43-53 already do, or deserialize it from a file.
The clone_many_times
must be done after this (see line 54 for an explanation why).
Cheers,
Brendan
Wow thanks for the quick reply :)
I don't really understand how you can get the params
of the cloned sequence of modules, and put them into memory as a shared 1D tensor, that the optimizer can use?
I'm sort of confused because clones
is a table of modules, not a single module? Is there some trick I'm missing?
Here's the sequence of operations:
I think step 3 is where you're a bit confused here. We create the sequences of clones from the protos after the parameters of the prototypes all point to the shared tensor for optim.
Does that help?
On Fri, Apr 17, 2015 at 5:26 PM, Ajay Talati notifications@github.com wrote:
Wow thanks for the quick reply :)
I don't really understand how you can get the params of the cloned sequence of modules, and put them into memory as a shared 1D tensor, that the optimizer can use?
I'm sort of confused because clones is a table of modules, not a single module?
— Reply to this email directly or view it on GitHub https://github.com/oxford-cs-ml-2015/practical6/issues/3#issuecomment-94021116 .
Yes - I think so? :+1:
So the variables, params
and grad_params
are basically addresses to the shared tensor, and should be sort of saved, (for want of a better word), when I save the clones
closure? So when I reload the sequence of clones, the variables params
and grad_params
will be reloaded and still point to the shared tensor (which holds the weights/biases).
That seems a bit magical, but easy to do ? Basically I just reload the closure clones
, and start the optimization loop - as before - no need to use either of the functions from model_utils.lua
?
Sorry for the confusion, in my first reply I listed two options:
protos
and params
and grad_params
to save space in the saved model, then recreate clones on load (I explained this one in the " lines 43-53 " paragraph in the first reply: just serialize {params,grad_params,protos}
then recreate clones
after you deserialize)clones
, protos
, and params
and grad_params
all together in a tableIf you save clones like you just mentioned, that's correct as long as you serialize params
and grad_params
at the same time (e.g. put them all in one table {params,grad_params,clones,protos}
), which I neglected to mention. No need to use model_utils.lua
if you serialize everything together. IIRC protos
isn't used past line 50 or so, so you probably don't need to serialize it. I'd serialize it anyway though.
(Torch's serialization system will see that their Tensors point to the same Storage objects as all the params inside the modules, and so they'll point to the same thing on deserialization too.)
Edit: To comment on your initial post: I missed part because I read it before you edited. Your way with copying parameter values would work as well, but option 1 above is the way I usually do it.
Wow, thank's a lot for great explanation :+1: - I'm trying it now - it takes a long time for my model to train - so it's not easy to tell if it's working?? I think it is though.
So basically just saving and unpacking to their original names, the 'things' in the following table, i.e
table_to_save = {params,grad_params,clones,protos, opt}
save it, reload it and unpack it
params,grad_params,clones,protos, opt = unpack( table_to_save )
is all you have to do? All the parameter/memory sharing technicalities are magically reloaded
. I did'nt think it would be that easy?
On a different level though its still kind of confusing/unsatisfying? This is a bit theoretical, so bear with me, but in terms of information theory and source coding, my system is a variational autoencoder. So if I add the number of bits
that it takes for the algorithms of the luajit compiler, the essential modules of Torch I use, my VAE system's .lua files and the my trained models saved parameters, which is just a few million 64 bit numbers, I should have the amount of information that's coded into my algorithmic/generative model, which is basically a probability distribution, of my dataset which is cluttered MNIST32. So in total that lot I guess is 1Gig at the most.
If I clone
my trained modules and then save them all, the full amount of data saved comes to about 5 Gig each time. So it just seems that in terms of information and data compression, its more satisfactory just to recreate the system fresh, using model utils
, and then :copy
the saved parameter numbers into the shared param
tensor of the freshly (re) built system.
What do you think?
I need to do some more experiments just for my own sanity to make sure both methods work :+1:
That's the correct way to serialize, yes, and sharing/references are handled correctly. Without going into too much detail, there's a few different levels of a serialization system's complexity regarding pointers/references. In C/C++ notation, if &a.b == &c.b, then when serializing a and c together we'd expect &a.b == &c.b when deserializing too. In the case of parameter sharing, a and c are Tensors, and the b is the shared underlying Storage. Remember there's only one Storage for the parameters in the entire network. More advanced serialization libraries can correctly (de)serialize pointer/reference cycles (torch probably does as well, but I haven't checked, and this situation is probably rare for most torch code anyway).
The amount of space is large because the activations and gradients in each clone in the network are being serialized too (i.e. module.output and module.gradOutput for each module). The values in these are obviously useless. To avoid this, serialize just params,grad_params,protos,opt and recreate clones using clone_many_times when you start, or just use your solution of serializing parameter values and copying them (but remember to do this after calling combine_all_params).
On Fri, Apr 17, 2015 at 9:23 PM, Ajay Talati notifications@github.com wrote:
Wow, thank's a lot for great explanation [image: :+1:] - I'm trying it now - it takes a long time for my model to train - so it's not easy to tell if it's working?? I think it is though.
So basically just saving and unpacking to their original names, the 'things' in the following table, i.e
table_to_save = {params,grad_params,clones,protos, opt}
save it, reload it and unpack it
params,grad_params,clones,protos, opt = unpack( table_to_save )
is all you have to do? All the parameter/memory sharing technicalities are magically reloaded. I did'nt think it would that easy?
On a different level though its still kind of confusing/unsatisfying? This is a bit theoretical, so bear with me, but in terms of information theory and source coding, my system is a variational autoencoder. So if I add the number of bits that it takes for the algorithms of the luajit compiler, the essential modules of Torch I use, my VAE system's .lua files and the my trained models saved parameters, which is just a few million 64 bit numbers, I should have the amount of information that's coded into my algorithmic/generative model, which is basically a probability distribution, of my dataset which is cluttered MNIST32. So in total that lot I guess is 1Gig at the most.
If I clone my trained modules and then save them all, the full amount of data saved comes to about 5 Gig each time. So it just seems that in terms of information and data compression, its more satisfactory just to recreate the system fresh, using model utils, and then :copy the saved parameter numbers into the shared param tensor of the freshly (re) built system.
What do you think?
I need to do some more experiments just for my own sanity to make sure both methods work [image: :+1:]
— Reply to this email directly or view it on GitHub https://github.com/oxford-cs-ml-2015/practical6/issues/3#issuecomment-94067175 .
Brilliant - thank you very much for the clear explanation :+1:
Hi Brendan,
thanks for all the great help you've given me. Just to share with you a little trick I found.
i) Train a system for say n
timesteps/clones of the master modules and save (serialize) the parameter, and grad parameter tensors, and the first clone in the method as you explained above,
ii) then rebuild your system fresh with an extra timestep/clone n+1
, using model_utils.clone_many_times
, on the first clone
I think this little trick is working, (at least for the variational auto encoder I'm working on :+1: )
Hi, just an update on my suggest trick of restarting training with a rebuilt system with an added clone - after doing more controlled experiments - it does not seem to be working?
Basically I've found that there's no substitution for fixing a number of clones/timesteps and being patient, waiting for the system to start breaking it's symmetries.
My suggested trick of using the parameters of a shorter system, as the initial parameters of a system with an extra clone/timestep, seems to restrict the parameter space, and result in a higher final loss. The standard method of simply training using the desired number of timesteps, and being patient, or finding better ways to initialize the system, or other tricks, seems to result in a lower final loss.
Very sorry for the half-baked idea :-1:
Thanks a lot for all your great help Brendan :+1:
I think anyone who reads through this issue will get a few choices of how to save and restart training networks.
Best regards, Aj
Nice if i could get the code to work.
Hi,
I've been experimenting with using the model_utils.lua file on some on my own concatenations of gModules. I was just wondering if you could give an example of how to use,
model_utils.combine_all_parameters
and
model_utils.clone_many_times
to get the
params
andgrad_params
of a savedprotos
which can then be used with some of the appropriate lines oftrain.lua
to restart training?Just to give some context - what I've tried is instead of saving the full protos, just saving the following table,
table_to_save = { options = opt , saved_params=params, saved_grad_params=grad_params }
Then used basically all of
train.lua
, with the following,saved_data = torch.load(saved_filename)
opt = saved_data.options
params:copy( saved_data.saved_params )
grad_params:copy( saved_data.grad_params )
That is, I recreate the system using the same options and clone it in the same way - the main change is simply transferring the saved
params
andgrad_params
before starting the optimization.I was just wondering if this is the right way to do it?
Thanks for your help :+1:
Best regards,
Aj