Error when training started from another model

wujian752 commented 7 years ago

I have tried to initialize the model with a trained model.

However, when I set the 'start_from' to the path to the trained model, I got an error as shown below.

initializing weights from checkpoint_path/model_id.t7
...jian/torch/install/bin/luajit: torch/install/share/lua/5.1/nn/Module.lua:297: misaligned parameter at 2
stack traceback:
        [C]: in function 'assert'
        ...jian/torch/install/share/lua/5.1/nn/Module.lua:297: in function 'getParameters'
        train.lua:158: in main chunk
        [C]: in function 'dofile'
        ...jian/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670

It seems the storageOffset of parameters and gradParameters are not same because of the net_utils.unsanitize_gradients and net_utils.sanitize_gradients.

jimie208 commented 7 years ago

I met the same questions and I haven't understand why. I think the trained model may exchanged the parameters or gradparameters in a wrong way when training so it doesn't work.

jimie208 commented 7 years ago

I use this method:

params = {} for i = 1,21 do params[i] = model:get(i):getParameters() end torch.save(params[i])

to get the params in every layer(and 21 is the layer in my model). and then use these params to value a new model i create.

karpathy / neuraltalk2

Error when training started from another model #152