How to use `model_utils.combine_all_parameters` ?

AjayTalati commented 9 years ago

Hi Brendan,

I was just wondering if you could give me some advice on how to use model_utils.combine_all_parameters. I tracked down that that's where I was doing something fundamentally wrong, and I've got my code to work - I just don't fully understand it why it works? Or, maybe I just got a few lucky runs?

The problem I'm working on is a variational autoencoder - in its simplest form it consists of 3 gModules, an encoder, Q-sampler and decoder. If I put these gModules, into another nngraph gModule, say system_module, and use the standard parameter/gradient flattening tool,

params, grad_params = system_module:getParameters()

and then clone, system_module, everything's fine, and it works perfectly, (and the code in feval is much more concise).

If instead I didn’t put the encoder, sampler and decoder into a big system_module, and used

params, grad_params = model_utils.combine_all_parameters( system_t.encoder_t , system_t.sampler_t , system_t.decoder_t )

I run was running into problems, if I did not wrap the forward and backward methods of say system_t.sampler_t[t], inside unpack. So it seems you need to get the ordering of the unpacking of the table outputs of the gModules exactly the same in as you define them. Otherwise it seems this does not line up with the parameters/grads, stored in place in memory.

So as a concrete example I wonder if you could just clarify the difference between, your new method model_utils.combine_all_parameters, where you unpack the output of the LSTM gModule directly,

-- backprop through LSTM timestep dembeddings[t], dlstm_c[t-1], dlstm_h[t-1] = unpack(clones.lstm[t]:backward( {embeddings[t], lstm_c[t-1], lstm_h[t-1]}, {dlstm_c[t], dlstm_h[t]}
))

and the old method, which for readability (to help me), I use intermediate tables for both input and output, (which I guess messes up the memory addresses in the new method),

local input_of_LSTM_at_t = {embeddings[t], lstm_c[t-1], lstm_h[t-1]}

local doutput_of_LSTM_at_t = {dlstm_c[t], dlstm_h[t]}

dinput_of_LSTM_at_t = clones.lstm[t]:backward( input_of_LSTM_at_t , doutput_LSTM_at_t )

dembeddings[t] = dinput_LSTM_at_t[1] dlstm_c[t-1] = dinput_LSTM_at_t[2] dlstm_h[t-1] = dinput_LSTM_at_t[3]

I'm still new to Torch/Lua, and don't really understand either model_utils.combine_all_parameters or getParameters() and torch.pointer. Sorry for the long question, any chance of a little explanation?

Best,

Aj

bshillingford commented 9 years ago

Hi Aj,

In the code, combine_all_parameters just creates a contiguous 1D tensor large enough to hold all the parameters, then points all the modules to this one 1D tensor. The existing weights and gradients tensors are discarded.

Suppose you had 3 modules, M1,M2,M3, where M1 and M2 shared their parameters (and gradients), but M3 did not. Calling combine_all_parameters on these 3 will produce a 1D tensor with space for the params of M1 and M2 and M3, with the params for the first two both being references to the same place in memory for M1 and M2. The params tensor for M3 will point to the latter half of the new 1D tensor. The 1D tensor gets returned as the first return value of combine_all_parameters, and the corresponding gradParams (constructed at the same time) is returned too.

This gives us a 1D view of the params in a set of modules, similarly to now nn.Container modules (e.g. nn.Sequential, among others) perform this reallocation. As for torch.pointer, that is used to check if two objects are the same object.

Your LSTM code is fine, splitting it up into multiple lines like that doesn't change anything. The parameter combining isn't related to the inputs/outputs of the modules, just to the parameters/gradParams inside the module objects.

Hope that helps. Let me know if anything's still unclear.

Cheers,

Brendan

On Sat, Mar 28, 2015 at 7:04 AM, Ajay Talati notifications@github.com wrote:

Hi Brendan,

I was just wondering if you could give me some advice on how to use model_utils.combine_all_parameters. I tracked down that that's where I was doing something fundamentally wrong, and I've got my code to work - I just don't fully understand it why it works?

The problem I'm working on is a variational autoencoder - in its simplest form it consists of 3 gModules, an encoder, Q-sampler and decoder. If I put these gModules, into another nngraph gModule, say system_module, and use the standard parameter/gradient flattening tool,

params, grad_params = system_module:getParameters()

and then clone, system_module, everything's fine, and it works perfectly, (and the code in feval is much more concise).

If instead I didn’t put the encoder, sampler and decoder into a big system_module, and used

params, grad_params = model_utils.combine_all_parameters( system_t.encoder_t , system_t.sampler_t , system_t.decoder_t )

I run was running into problems, if I did not wrap the forward and backward methods of say system_t.sampler_t[t], inside unpack. So it seems you need to get the ordering of the unpacking of the table outputs of the gModules exactly the same in as you define them. Otherwise it seems this does not line up with the parameters/grads, stored in place in memory.

So as a concrete example I wonder if you could just clarify the difference between, your new method model_utils.combine_all_parameters, where you unpack the output of the LSTM gModule directly,

-- backprop through LSTM timestep dembeddings[t], dlstm_c[t-1], dlstm_h[t-1] = unpack(clones.lstm[t]:backward( {embeddings[t], lstm_c[t-1], lstm_h[t-1]}, {dlstm_c[t], dlstm_h[t]}

))

and the old method, which for readability (to help me), I use intermediate tables for both input and output, (which I guess messes up the memory addresses in the new method),

local input_of_LSTM_at_t = {embeddings[t], lstm_c[t-1], lstm_h[t-1]}

local doutput_of_LSTM_at_t = {dlstm_c[t], dlstm_h[t]}

dinput_of_LSTM_at_t = clones.lstm[t]:clones.lstm[t]:backward( input_of_LSTM_at_t , doutput_LSTM_at_t )

dembeddings[t] = dinput_LSTM_at_t[1] dlstm_c[t-1] = dinput_LSTM_at_t[2] dlstm_h[t-1] = dinput_LSTM_at_t[3]

I'm still new to Torch/Lua, and don't really understand either model_utils.combine_all_parameters or getParameters() and torch.pointer. Sorry for the long question, any chance of a little explanation?

Best,

Aj

— Reply to this email directly or view it on GitHub https://github.com/oxford-cs-ml-2015/practical6/issues/2.

AjayTalati commented 9 years ago

Hi Brendan,

sorry not to reply earlier, and to close this issue. I owe you a big thank you for your clear explanation, it helped a lot :+1:!

I think've managed to get things working in the end using model_utils.combine_all_parameters, and recoding some modules I wrote. I'm pretty sure, (99.9%) that your codes fine, and all my problems are my own ????

It turns out that training this LSTM VAE is a bit subtle - it's taking more time to train it, than to understand it and code it up. At the moment it seems that to diagnose this thing there are two aspects you need to keep monitoring,

the encoder which basically controls the latent loss, (KL divergence term), and acts as a regularizer - at the moment I'm trying batch normalization (trying both with and with out trainable parameters) to control the standard deviation of this. I'm using a hack where I take the gradient of the KL loss to be zero, because I'm not sure what else to try?
the decoder which basically controls the reconstruction loss, (BCE term) - at the moment I'm experimenting with large batches (~256) to control the standard deviation of this. The DRAW paper says they only use a single sample, so they give the impression that its reasonable to do this?
adding learn-able biases - nn.Add() nodes - to the initial values of the LSTM seems to help too
I spent a couple of weeks nearly trying to use BPTT, till I finally realized it just doesn't make sense for the problem I'm working on :(

Would be very happy to share my code with you, when it's working better, if you were interested?

Best regards,

Aj

oxford-cs-ml-2015 / practical6

How to use `model_utils.combine_all_parameters` ? #2