Training the net locks up the operating system

SevenBlocks commented 9 years ago

Update: I deleted and completely reinstalled the latest char-rnn and torch, optim, and nngraph code from github and that solved the problem. Either there was a bug fix in one of the repositories or I had screwed up my first installation somehow.

When I run th ./train.lua the net will get through about 1500 training iterations before my entire system locks up and I have to restart the operating system. This happens every time I try to train the net. Is there any way to troubleshoot the cause of this issue or any error logs I can examine?

My setup is: Operating system: Ubuntu 14.04 torch7 lua 5.1 CPU mode (gpuid = -1) All other options are the defaults in the train.lua file

I tried lowering the priority of the process, but that didn't help. I scanned through some of the linux logs in /var/log, but I'm not sure what to look for. Any help is appreciated. Thanks.

raidancampbell commented 9 years ago

What parameters were you using to train the model? (num_layers, depth, etc)

SevenBlocks commented 9 years ago

I left everything pretty much as it was in the original github file except for gpuid=-1, batch_size=55, and init_from, since it got far enough to create a checkpoint file. Here's the values for all the parameters:

cmd = torch.CmdLine()
cmd:text()
cmd:text('Train a character-level language model')
cmd:text()
cmd:text('Options')
-- data
cmd:option('-data_dir','data/tinyshakespeare','data directory. Should contain the file input.txt with input data')
-- model params
cmd:option('-rnn_size', 128, 'size of LSTM internal state')
cmd:option('-num_layers', 2, 'number of layers in the LSTM')
cmd:option('-model', 'lstm', 'lstm,gru or rnn')
-- optimization
cmd:option('-learning_rate',2e-3,'learning rate')
cmd:option('-learning_rate_decay',0.97,'learning rate decay')
cmd:option('-learning_rate_decay_after',10,'in number of epochs, when to start decaying the learning rate')
cmd:option('-decay_rate',0.95,'decay rate for rmsprop')
cmd:option('-dropout',0,'dropout for regularization, used after each RNN hidden layer. 0 = no dropout')
cmd:option('-seq_length',50,'number of timesteps to unroll for')
cmd:option('-batch_size',55,'number of sequences to train on in parallel')
cmd:option('-max_epochs',50,'number of full passes through the training data')
cmd:option('-grad_clip',5,'clip gradients at this value')
cmd:option('-train_frac',0.95,'fraction of data that goes into train set')
cmd:option('-val_frac',0.05,'fraction of data that goes into validation set')
            -- test_frac will be computed as (1 - train_frac - val_frac)
cmd:option('-init_from',
           "/media/myusrname/torch7_tutorials/karpathy_rnn_code/cv/lm_lstm_epoch2.60_1.7004.t7",
           'initialize network parameters from checkpoint at this path')
-- bookkeeping
cmd:option('-seed',123,'torch manual random number generator seed')
cmd:option('-print_every',1,'how many steps/minibatches between printing out the loss')
cmd:option('-eval_val_every',1000,'every how many iterations should we evaluate on validation data?')
cmd:option('-checkpoint_dir', 'cv', 'output directory where checkpoints get written')
cmd:option('-savefile','lstm','filename to autosave the checkpont to. Will be inside checkpoint_dir/')
-- GPU/CPU
cmd:option('-gpuid',-1,'which gpu to use. -1 = use CPU')
cmd:option('-opencl',0,'use OpenCL (instead of CUDA)')
cmd:text()

SevenBlocks commented 9 years ago

I just reinstalled the latest torch and char-rnn repositories and that fixed the problem.

karpathy / char-rnn

Training the net locks up the operating system #90