karpathy / char-rnn

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch
11.57k stars 2.58k forks source link

code for init_from in training has few bugs #131

Open udibr opened 8 years ago

udibr commented 8 years ago

In line https://github.com/karpathy/char-rnn/blob/master/train.lua#L127 there is a Lua syntax mistake

the checkpoint vocab may be smaller than input vocab and pass the test

the checkpoint may have a model which is not lstm

wrapperband commented 8 years ago

Thanks, this looks like what was causing my "bug", (which was well discouraging) , where training loss kept going to infinity, even with new data.. I'll retest, and see if it helps my case.

udibr commented 8 years ago

if you are using a different input.txt when re-running the training code with an old model (using -init_from flag) then you need more changes to the code. I have support for this on my develop branch https://github.com/udibr/char-rnn/tree/develop

also if the new input.txt file is in the same directory where the old file was then you should delete the data.t7 and vocab.t7

wrapperband commented 8 years ago

Thanks for the update, sounds like the new inputs also caused me some of the problems. But I always created new data.tz and vocab.t7 for updated input, so that didn't.

wrapperband commented 8 years ago

I'm setting up a new PC, with R9 290, which might be causing other problems, 15.10.
Bearing that in mind, I tested the udibr's development version and got the same / similar errors. With either char-rnn version I can't start a new net.

I had 2 other errors, one re "? (in a diamond) characters when creating the data.t7 and vocab.t7.
I had trouble getting those as the graphics crashes and you loose the window top menu bars (KUbuntu).
I have re-installed torch etc a couple of times, but will do a complete reinstall next, if no other ideas. I have already swapped checkpoints between R9 270 and HD 6970, so moving to the R9 290 should be OK. I've done a couple of driver re-instals.

th train.lua -data_dir ~/programs/char-rnn/data/songster11 -opencl 1 -gpuid 0 -init_from cv/Songster3-0-02.t7 -dropout .5 -seed 97 -eval_val_every 1200 -savefile 'Songster4-1-6.95-286' -max_epochs 1 -train_frac 0.95 -val_frac 0.05

th train.lua -data_dir ~/programs/char-rnn/data/songster11 -opencl 1 -seq_length 180 -rnn_size 700 -num_layers 4 -max_epochs 50 -savefile 'Songster4-0.94' -eval_val_every 2000 -train_frac 0.945 -val_frac 0.05

user@marvin-songster:~/programs/char-rnn$ ./songster.sh
using OpenCL on GPU 0... loading data files... cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 1395, val: 74, test: 0 vocab size: 114 loading a model from checkpoint cv/Songster3-0-02.t7 Using Advanced Micro Devices, Inc. , OpenCL platform: AMD Accelerated Parallel Processing Using OpenCL device: Hawaii checkpoint_vocab_size: 113 /home/user/torch/install/bin/luajit: train.lua:137: error, the character vocabulary for this dataset and the one in the saved checkpoint are not the same. This is trouble. stack traceback: [C]: in function 'assert' train.lua:137: in main chunk [C]: in function 'dofile' ...user/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00405d70

user@marvin-songster:~/programs/char-rnn$ ./starter.sh
using OpenCL on GPU 0... loading data files... cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 385, val: 20, test: 3 vocab size: 114 creating an lstm with 4 layers Using Advanced Micro Devices, Inc. , OpenCL platform: AMD Accelerated Parallel Processing Using OpenCL device: Hawaii setting forget gate biases to 1 in LSTM layer 1 setting forget gate biases to 1 in LSTM layer 2 setting forget gate biases to 1 in LSTM layer 3 setting forget gate biases to 1 in LSTM layer 4 number of parameters in the model: 14141514 cloning rnn cloning criterion /home/user/torch/install/bin/luajit: /home/user/torch/install/share/lua/5.1/nn/CAddTable.lua:21: Error: copyTo failed with -4 at /tmp/luarocks_cltorch-scm-1-458/cltorch/cltorch/src/lib/THClTensorCopy.cpp:162 stack traceback: [C]: in function 'copy' /home/user/torch/install/share/lua/5.1/nn/CAddTable.lua:21: in function 'updateGradInput' /home/user/torch/install/share/lua/5.1/nngraph/gmodule.lua:327: in function 'neteval' /home/user/torch/install/share/lua/5.1/nngraph/gmodule.lua:361: in function 'updateGradInput' /home/user/torch/install/share/lua/5.1/nn/Module.lua:30: in function 'backward' train.lua:284: in function 'opfunc' /home/user/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop' train.lua:314: in main chunk [C]: in function 'dofile' ...user/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00405d70

wrapperband commented 8 years ago

Here's the error message restarting from checkpoint with using -init_from flag and the updated version

user@marvin-songster:~/programs/char-rnn1$ ./songster.sh using OpenCL on GPU 0... loading a model from checkpoint cv/lm_Songster4-1-6.95-286_epoch1.00_1.6287.t7 Using Advanced Micro Devices, Inc. , OpenCL platform: AMD Accelerated Parallel Processing Using OpenCL device: Hawaii overwriting rnn_size=700, num_layers=4, model=lstm based on the checkpoint. vocab.t7 and data.t7 do not exist. Running preprocessing... one-time setup: preprocessing input text file /home/user/programs/char-rnn1/data/songster11/input.txt... loading text file... creating vocabulary mapping... putting data into tensor... /home/user/torch/install/bin/luajit: ./util/CharSplitLMMinibatchLoader.lua:171: char "� not in dictionary stack traceback: [C]: in function 'assert' ./util/CharSplitLMMinibatchLoader.lua:171: in function 'text_to_tensor' ./util/CharSplitLMMinibatchLoader.lua:38: in function 'create' train.lua:141: in main chunk [C]: in function 'dofile' ...user/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00405d70 user@marvin-songster:~/programs/char-rnn1$

Atcold commented 8 years ago

Retraining has also this bug -> https://github.com/karpathy/char-rnn/issues/137