Open udibr opened 8 years ago
Thanks, this looks like what was causing my "bug", (which was well discouraging) , where training loss kept going to infinity, even with new data.. I'll retest, and see if it helps my case.
if you are using a different input.txt when re-running the training code with an old model (using -init_from flag) then you need more changes to the code. I have support for this on my develop branch https://github.com/udibr/char-rnn/tree/develop
also if the new input.txt file is in the same directory where the old file was then you should delete the data.t7 and vocab.t7
Thanks for the update, sounds like the new inputs also caused me some of the problems. But I always created new data.tz and vocab.t7 for updated input, so that didn't.
I'm setting up a new PC, with R9 290, which might be causing other problems, 15.10.
Bearing that in mind, I tested the udibr's development version and got the same / similar errors. With either char-rnn version I can't start a new net.
I had 2 other errors, one re "? (in a diamond) characters when creating the data.t7 and vocab.t7.
I had trouble getting those as the graphics crashes and you loose the window top menu bars (KUbuntu).
I have re-installed torch etc a couple of times, but will do a complete reinstall next, if no other ideas. I have already swapped checkpoints between R9 270 and HD 6970, so moving to the R9 290 should be OK. I've done a couple of driver re-instals.
th train.lua -data_dir ~/programs/char-rnn/data/songster11 -opencl 1 -gpuid 0 -init_from cv/Songster3-0-02.t7 -dropout .5 -seed 97 -eval_val_every 1200 -savefile 'Songster4-1-6.95-286' -max_epochs 1 -train_frac 0.95 -val_frac 0.05
th train.lua -data_dir ~/programs/char-rnn/data/songster11 -opencl 1 -seq_length 180 -rnn_size 700 -num_layers 4 -max_epochs 50 -savefile 'Songster4-0.94' -eval_val_every 2000 -train_frac 0.945 -val_frac 0.05
user@marvin-songster:~/programs/char-rnn$ ./songster.sh
using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 1395, val: 74, test: 0
vocab size: 114
loading a model from checkpoint cv/Songster3-0-02.t7
Using Advanced Micro Devices, Inc. , OpenCL platform: AMD Accelerated Parallel Processing
Using OpenCL device: Hawaii
checkpoint_vocab_size: 113
/home/user/torch/install/bin/luajit: train.lua:137: error, the character vocabulary for this dataset and the one in the saved checkpoint are not the same. This is trouble.
stack traceback:
[C]: in function 'assert'
train.lua:137: in main chunk
[C]: in function 'dofile'
...user/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d70
user@marvin-songster:~/programs/char-rnn$ ./starter.sh
using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 385, val: 20, test: 3
vocab size: 114
creating an lstm with 4 layers
Using Advanced Micro Devices, Inc. , OpenCL platform: AMD Accelerated Parallel Processing
Using OpenCL device: Hawaii
setting forget gate biases to 1 in LSTM layer 1
setting forget gate biases to 1 in LSTM layer 2
setting forget gate biases to 1 in LSTM layer 3
setting forget gate biases to 1 in LSTM layer 4
number of parameters in the model: 14141514
cloning rnn
cloning criterion
/home/user/torch/install/bin/luajit: /home/user/torch/install/share/lua/5.1/nn/CAddTable.lua:21: Error: copyTo failed with -4 at /tmp/luarocks_cltorch-scm-1-458/cltorch/cltorch/src/lib/THClTensorCopy.cpp:162
stack traceback:
[C]: in function 'copy'
/home/user/torch/install/share/lua/5.1/nn/CAddTable.lua:21: in function 'updateGradInput'
/home/user/torch/install/share/lua/5.1/nngraph/gmodule.lua:327: in function 'neteval'
/home/user/torch/install/share/lua/5.1/nngraph/gmodule.lua:361: in function 'updateGradInput'
/home/user/torch/install/share/lua/5.1/nn/Module.lua:30: in function 'backward'
train.lua:284: in function 'opfunc'
/home/user/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop'
train.lua:314: in main chunk
[C]: in function 'dofile'
...user/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d70
Here's the error message restarting from checkpoint with using -init_from flag and the updated version
user@marvin-songster:~/programs/char-rnn1$ ./songster.sh using OpenCL on GPU 0... loading a model from checkpoint cv/lm_Songster4-1-6.95-286_epoch1.00_1.6287.t7 Using Advanced Micro Devices, Inc. , OpenCL platform: AMD Accelerated Parallel Processing Using OpenCL device: Hawaii overwriting rnn_size=700, num_layers=4, model=lstm based on the checkpoint. vocab.t7 and data.t7 do not exist. Running preprocessing... one-time setup: preprocessing input text file /home/user/programs/char-rnn1/data/songster11/input.txt... loading text file... creating vocabulary mapping... putting data into tensor... /home/user/torch/install/bin/luajit: ./util/CharSplitLMMinibatchLoader.lua:171: char "� not in dictionary stack traceback: [C]: in function 'assert' ./util/CharSplitLMMinibatchLoader.lua:171: in function 'text_to_tensor' ./util/CharSplitLMMinibatchLoader.lua:38: in function 'create' train.lua:141: in main chunk [C]: in function 'dofile' ...user/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00405d70 user@marvin-songster:~/programs/char-rnn1$
Retraining has also this bug -> https://github.com/karpathy/char-rnn/issues/137
In line https://github.com/karpathy/char-rnn/blob/master/train.lua#L127 there is a Lua syntax mistake
the checkpoint vocab may be smaller than input vocab and pass the test
the checkpoint may have a model which is not lstm