hughperkins / clnn

OpenCL backend for Torch nn neural networks library
BSD 2-Clause "Simplified" License
125 stars 16 forks source link

Issue with using an older card or something in pre-reqs? #29

Closed bturetzky closed 8 years ago

bturetzky commented 8 years ago

Hello,

I've been trying to run https://github.com/karpathy/char-rnn on an D2700 atom computer with a GeForce 8400 GS and I'm having problems with clnn specifically. Incidentally, I have been able to run the char-rnn project on the CPU, just not the GPU.

cltorch.test() passes, EasyCL tests pass. nn.test() passes. I've included the clnn test output.

From the output it seemed that maybe I had some sort of mismatch in the chain of dependent software but I'm new to torch/lua/opencl so I'm pretty lost there. Alternatively, I thought maybe this card might just be too old (OpenCL 1.1?) for some of the things that clnn is doing? Any help would be appreciated. I'm running with Nvidia's 340.96 driver as thats the one I could get working. Running on a clean install of Ubuntu 15.10

clnntest.txt

bturetzky commented 8 years ago

Finally got back to this. Adding ClassNLLCriterionMultipleTarget back to the test suite reproduces all the errors for the SpatialConvolutionMM tests.

luajit -l clnn -e "clnn.test{'SpatialConvolutionMM_forward_single','SpatialConvolutionMM_forward_single_vgglayer13','SpatialConvolutionMM_forward_single_padded','SpatialConvolutionMM_forward_batch','SpatialConvolutionMM_backward_single','SpatialConvolutionMM_backward_batch','ClassNLLCriterionMultipleTarget'}"
Completed 0 asserts in 7 tests with 7 errors
bturetzky commented 8 years ago

The others also appear to be false positives:

 luajit -l clnn -e "clnn.test{'Sqrt_zero'}"
libthclnn_searchpath    /home/bowen/torch/install/lib/lua/5.1/libTHCLNN.so
Running 1 tests
|  ==> Sqrt_zeroUsing NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 8400GS
_  ==> Done

Completed 2 asserts in 1 tests with 0 errors

--------------------------------------------------------------------------------
$ luajit -l clnn -e "clnn.test{'Tanh_transposed'}"
libthclnn_searchpath    /home/bowen/torch/install/lib/lua/5.1/libTHCLNN.so
Running 1 tests
|  ==> Tanh_transposedUsing NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 8400GS
_  ==> Done

Completed 2 asserts in 1 tests with 0 errors

--------------------------------------------------------------------------------
$ luajit -l clnn -e "clnn.test{'Threshold_transposed'}"
libthclnn_searchpath    /home/bowen/torch/install/lib/lua/5.1/libTHCLNN.so
Running 1 tests
|  ==> Threshold_transposedUsing NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 8400GS
_  ==> Done

Completed 2 asserts in 1 tests with 0 errors

--------------------------------------------------------------------------------
hughperkins commented 8 years ago

Ah, good info! Thanks! That's the test that is running out of memory, right?

bturetzky commented 8 years ago

Correct, well... "Memory object allocation failure, code -4" anyway, not sure if its necessarily oom.

hughperkins commented 8 years ago

Ok. Seems like that test allocates a huugggee chunk of memory, see line 744 of test.lua. I dont remember why I thought that was a good idea. Dont suppose... if you get a moment, do you mind tweaking the numbers down a bit, until you find a pair of numbers which dont give the error? You'll need to check out the repo first, and build from the repo each time.

To check out the repo:

git clone https://github.com/hughperkins/clnn.git
cd clnn

To build from the repo:

luarocks rocks/clnn-scm-1.rockspec
hughperkins commented 8 years ago

Actually... I dont like the random bit. Let's replace line 744 with:

   local size = 3000

... and then just reduce this number, until the test passes for you.

bturetzky commented 8 years ago

Hmm, not sure about that build command, didn't work for me, but editing the test.lua file in torch/install/share/lua/5.1/clnn gave me success:

luajit -l clnn -e "clnn.test{'ClassNLLCriterionMultipleTarget'}"
libthclnn_searchpath    /home/bowen/torch/install/lib/lua/5.1/libTHCLNN.so
Running 1 tests
|  ==> ClassNLLCriterionMultipleTargetUsing NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 8400GS
_  ==> Done

Completed 2 asserts in 1 tests with 0 errors

--------------------------------------------------------------------------------

Just to check I reran all the tests and I get Completed 98 asserts in 64 tests with 0 errors now. Seems my old card can't handle your tests :-P

hughperkins commented 8 years ago

Hmm, not sure about that build command, didn't work for me

Ah whoops, build command should be luarocks make rocks/clnn-scm-1.rockspec, I missed out the make, by accident.

editing the test.lua file in torch/install/share/lua/5.1/clnn gave me success

Ok. What did you put at line 744, so that it passes? local size = 3000?

bturetzky commented 8 years ago

Yep, 3000.

hughperkins commented 8 years ago

Cool. Will update.

hughperkins commented 8 years ago

ok. Updated in 673ed18

bturetzky commented 8 years ago

Awesome, thanks for all the help.

hughperkins commented 8 years ago

Cool. Glad it worked out :-)