karpathy / char-rnn

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch
11.53k stars 2.58k forks source link

Cannot Train: attempt to call field 'ClassNLLCriterion_updateOutput' (a nil value) #139

Open eternaldensity opened 8 years ago

eternaldensity commented 8 years ago

Training used to work on my computer, though I hadn't tried it in several months. I've occasionally done some sampling, but no training since July (largely because my CPU was overheating, and CUDA is not a feasible option sadly).

Any attempt at training invariably gives me the following error in the nn library. Reinstalling Torch and the nngraph/optim/nn packages has not helped. I redownloaded the latest version of char-rnn into a new folder and that didn't help either. Any ideas on what I need to do to get this working?

It looks like the nn package isn't being initialised correctly in torch, but I don't know why/how.

The only thing Google has turned up so far is https://github.com/torch/nn/issues/122 and that isn't at all applicable. I can't even install cutorch/cunn on my system, as my hardware cannot support it. Maybe I should just get a new computer :P

ed@ed:~/char-rnn-master$ th train.lua -gpuid -1 -data_dir data/fora -rnn_size 512 -num_layers 2 -dropout 0.5
loading data files...   
cutting off end of data so that the batches/sequences divide evenly 
reshaping tensor... 
data load done. Number of data batches in train: 1414, val: 75, test: 0 
vocab size: 187 
creating an lstm with 2 layers  
setting forget gate biases to 1 in LSTM layer 1 
setting forget gate biases to 1 in LSTM layer 2 
number of parameters in the model: 3632827  
cloning rnn 
cloning criterion   
/home/ed/torch/install/bin/luajit: .../ed/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:44: attempt to call field 'ClassNLLCriterion_updateOutput' (a nil value)
stack traceback:
    .../ed/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:44: in function 'forward'
    train.lua:274: in function 'opfunc'
    /home/ed/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop'
    train.lua:314: in main chunk
    [C]: in function 'dofile'
    ...e/ed/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
mrob27 commented 8 years ago

I see you're not using cutorch/cunn. When it was working, were you using OpenGL (cltorch/clnn) or CPU (torch/nn generic)?

torch/nn generic seems to define ClassNLLCriterion_updateOutput here: https://github.com/torch/nn/blob/master/generic/ClassNLLCriterion.c and cunn defines it here: https://github.com/torch/cunn/blob/master/ClassNLLCriterion.cu but I can't find where clnn defines it. (It ought to be in https://github.com/hughperkins/clnn somewhere: https://github.com/hughperkins/clnn/search?q=ClassNLLCriterion_updateOutput )

If that were the only issue, it appears that a simple way to test it might be to try training a new model by invoking train.lua with -opencl 0 and -gpuid -1 (telling it to use the CPU) I know your CPU overheats, but the test would just be to quickly find out if it gets past this error.

eternaldensity commented 8 years ago

I had it working with CPU (-gpuid -1) before.

FYI, using the python gist doesn't (quite) overhead because it only uses one core.

mrob27 commented 8 years ago

Does it still work with -gpuid -1 now? (That's the quick test I was asking you to perform.)

If yes, that implies you're trying to use OpenCL, but can't. Then I think the problem should be submitted at github.com/hughperkins/clnn , specifically, asking why there is no "ClassNLLCriterion_updateOutput" anywhere in that repository.

If no, that implies you'd be happy with anything, even CPU, but that's broken too. Then it's an issue for the torch/nn project, over at github.com/torch/nn and the question to ask is "why does generic/ClassNLLCriterion.c define ClassNLLCriterion_updateOutput but ClassNLLCriterion.lua does not see the definition?"

(And yes, I like the Python/numpy solution for similar reasons. I have a big desktop machine where I can have a few models training at once, and the computer is still extremely usable, stays cool, and the fan doesn't even run much. You can get my version at mrob.com/pub/comp/min-char-rnn.py.txt and read the block-comment for setup instructions. Same license as the original. )

eternaldensity commented 8 years ago

I don't quite follow. In my original post, it shows that I used -gpuid -1 What do you want me to change?

mrob27 commented 8 years ago

Oh, I see now: I thought it was just a bunch of error messages. Yes, "th train.lua -gpuid -1" is right there. Sorry! :neutral_face:

Okay, that means it's an issue for the torch/nn project.