Closed tigerneil closed 8 years ago
In theory, this was fixed yesterday, in https://github.com/hughperkins/cltorch/commit/2df6d9928d16613ef1cbc51e1324ae929db95979 . Do you mind installing the latest cltorch, luarocks install cltorch
, and confirm the problem does/doesnt persist?
bash-3.2$ th train.lua --dataset 1000 --hiddenSize 100
libthclnn_searchpath /Users/zhuxiaohu/torch/install/lib/lua/5.1/libTHCLNN.so
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...
Dataset stats:
Vocabulary size: 2536
Examples: 1569
Using Apple , OpenCL platform: Apple
Using OpenCL device: HD Graphics 4000
-- Epoch 1 / 50
/Users/zhuxiaohu/torch/install/bin/luajit: ...u/torch/install/share/lua/5.1/clnn/ClassNLLCriterion.lua:23: Input to clnn.ClassNLLCriterion should be 2-d tensor
stack traceback:
[C]: in function 'error'
...u/torch/install/share/lua/5.1/clnn/ClassNLLCriterion.lua:23: in function 'forward'
...u/torch/install/share/lua/5.1/rnn/SequencerCriterion.lua:27: in function 'forward'
./seq2seq.lua:80: in function 'train'
train.lua:81: in main chunk
[C]: in function 'dofile'
...aohu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:143: in main chunk
[C]: at 0x0107e5fbc0
the above problem fixed, however new error occurs. any comments?
ubuntu:~/git/neuralconvo-cl$ th train.lua --dataset 1000 --hiddenSize 100
libthclnn_searchpath /home/ubuntu/torch/install/lib/lua/5.1/libTHCLNN.so
-- Loading dataset
data/vocab.t7 not found
-- Parsing Cornell movie dialogs data set ...
/home/ubuntu/torch/install/bin/luajit: ./cornell_movie_dialogs.lua:6: data/cornell_movie_dialogs/movie_lines.txt: No such file or directory
I get this output. an error, but different from yours:
ubuntu:~/git/neuralconvo-cl$ th train.lua --dataset 1000 --hiddenSize 100
libthclnn_searchpath /home/ubuntu/torch/install/lib/lua/5.1/libTHCLNN.so
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...
Dataset stats:
Vocabulary size: 2536
Examples: 1569
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 940M
-- Epoch 1 / 50
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
/home/ubuntu/torch/install/share/lua/5.1/nn/LookupTable.lua:85: attempt to call field 'LookupTable_accGradParameters' (a nil value)
stack traceback:
To diagnose the error I see, I opened ~/torch/install/share/lua/5.1/clnn/LookupTable.lua
, and added at start of accGradParameters
:
print('torch.type(input)', torch.type(input))
Result when running:
torch.type(input) torch.IntTensor
I think input should be ClTensor in fact?
Seems like you're missing conversion of input and target to cl at lines 77-80 of train.lua:
if options.cuda then
input = input:cuda()
target = target:cuda()
end
Training seems to work ok for me now:
clnn lookuptable.accGradParameters()...... 90/1569 A: 2m8s | Step: 86ms
torch.type(input) torch.ClTensor
torch.type(gradOutput) torch.ClTensor
clnn lookuptable.accGradParameters()
torch.type(input) torch.ClTensor
torch.type(gradOutput) torch.ClTensor
clnn lookuptable.accGradParameters()...... 91/1569 A: 2m7s | Step: 86ms
torch.type(input) torch.ClTensor
torch.type(gradOutput) torch.ClTensor
clnn lookuptable.accGradParameters()
torch.type(input) torch.ClTensor
torch.type(gradOutput) torch.ClTensor
clnn lookuptable.accGradParameters()...... 92/1569 A: 2m7s | Step: 86ms
torch.type(input) torch.ClTensor
torch.type(gradOutput) torch.ClTensor
clnn lookuptable.accGradParameters()
torch.type(input) torch.ClTensor
(Edit, basically I added the following at line 81 of train.lua:
input = input:cl()
target = target:cl()
)
after I update my torch, it works well.
luarocks install torch
luarocks install nn
luarocks install nngraph
luarocks install cltorch
luarcoks install clnn
@hughperkins thanks for your patient help.
Cool :-)
https://github.com/macournoyer/neuralconvo/issues/14 @hughperkins new problems occurs.
evaluation failes with the following error:
./seq2seq.lua:124: attempt to call method 'sort' (a nil value)
stack traceback:
./seq2seq.lua:124: in function 'eval'
eval.lua:70: in function 'say'
[string "_RESULT={say "hello"}"]:1: in main chunk
[C]: in function 'xpcall'
/Users/zhuxiaohu/torch/install/share/lua/5.1/trepl/init.lua:650: in function 'repl'
...aohu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:197: in main chunk
[C]: at 0x0104929bc0
-9.7471
-9.4275
-9.8941
-9.7114
-9.6610
-9.8075
-9.9348
-9.7718
-9.7425
-9.7685
-9.7888
-9.7118
-9.8315
-9.7752
-9.5964
-9.7598
-9.8950
-9.5297
-9.7994
-9.8127
-9.6132
[torch.ClTensor of size 2536]
I try to add a line before error occurs print out the value of prediction. it seems normal compared with cutorch version.
Hmmm... I'm not sure I implemented/ported sort
. I'm not sure how much of a performance bottleneck the sort
step is in your algo? One thing you could try initially perhaps is simply convert that tensor into main memory, and sort on cpu, something like:
local predictionfloat = prediction:float()
local prob, wordIds = predictionfloat:sort(1, true)
If this doesnt affect performance too much, maybe this is acceptable? If this sort step is relatively slow, compared to the rest of the program, then we probably need to rethink. Note that implementing sort on the gpu is non-trivial.
(Hmmm, seems like you dont actually need to sort. You just need the max? :
-- First one is the most likely.
output = wordIds[1]
?
If this is is the case, you could simply use max
instead, like, something like (might not be quite correct):
local max, index = prediction:max(1)
)
Hi, in this evaluation stage, the memory we need is corresponding to the vocabulary size. Your first suggestion is OK. evaluation runs correctly, I can talk with the model now.
I will try larger hiddenSize and run complete training epoch to see what happen then. It should be OK.
maybe it is interesting to see how to implement sort in GPU. :)
thanks for your help and suggestion.
Cool. Do let me know if the sort
starts becoming a bottleneck. I'm not saying I'll be able to fix it in such a case, but I will at least know it's a pain-point that would be good to deal with.
Hey I am trying to add cltorch for neuralconvo . and my code is here cltorch support neural conversation model.
It seems all right until the following error occurs:
Seems like I should implement some methods in
clnn
LogSoftMax.lua.