karpathy / char-rnn

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch
11.53k stars 2.58k forks source link

Training and Validation not handled correctly #181

Open wrapperband opened 7 years ago

wrapperband commented 7 years ago

I have returned to testing Char-rnn, hoping previous issues might be cleared up, particularly starting from checkpoints, or adding additional training data, or what settings are optimum. I am still having problems.

There seems to be an inconsistency in how Train and Validation are set. If I start a new run without validation level then, this is what is created :

data load done. Number of data batches in train: 1446, val: 23, test: 0

note this should say : test : 23

If I set validation : then it is correct / proportional.

e.g. th train.lua -data_dir ~/char-rnn/data/songster11 -opencl 1 -gpuid 0 -init_from cv/lm_Songster11-1_epoch61.00_1.1215.t7 -dropout .05 -eval_val_every 1446 -learning_rate_decay_after 5 -savefile 'Songster11-2' -max_epochs 20 -train_frac 0.98444 -val_frac 0.0002

(where validation is v.small) gives :

data load done. Number of data batches in train: 1446, val: 0, test: 23

Just so I understand, what is the difference between training and validation?

wrapperband commented 7 years ago

The problem seems to be in /CharSplitLMMinibatchLoader.lua

On these lines :

-- perform safety checks on split_fractions assert(split_fractions[1] >= 0 and split_fractions[1] <= 1, 'bad split fraction ' .. split_fractions[1] .. ' for train, not between 0 and 1') assert(split_fractions[2] >= 0 and split_fractions[2] <= 1, 'bad split fraction ' .. split_fractions[2] .. ' for val, not between 0 and 1') assert(split_fractions[3] >= 0 and split_fractions[3] <= 1, 'bad split fraction ' .. split_fractions[3] .. ' for test, not between 0 and 1') if split_fractions[3] == 0 then -- catch a common special case where the user might not want a test set self.ntrain = math.floor(self.nbatches * split_fractions[1]) self.nval = self.nbatches - self.ntrain self.ntest = 0 else -- divide data to train/val and allocate rest to test self.ntrain = math.floor(self.nbatches * split_fractions[1]) self.nval = math.floor(self.nbatches * split_fractions[2]) self.ntest = self.nbatches - self.nval - self.ntrain -- the rest goes to test (to ensure this adds up exactly) end

Possible cause of error .

I'm currently assuming that this logic only works if a value is passed through for test, otherwise it is zero, and only validation happens.