naibaf7 / caffe

Caffe: a fast open framework for deep learning. With OpenCL and CUDA support.
http://caffe.berkeleyvision.org/
Other
86 stars 20 forks source link

caffe time utility causes a crash with ViennaCL: FATAL ERROR: CL_INVALID_PROGRAM_EXECUTABLE #23

Closed jainanshul closed 8 years ago

jainanshul commented 8 years ago

Using caffe/opencl tree with SHA 945f20bd0452893704239b29a8697e7cfc4378bf OS: OSX macbook pro with Intel Iris Pro GPU Flags used to compile caffe opencl: -DUSE_CUDA=OFF (left rest of the flags as default)

STR:

./build/tools/caffe time -model examples/mnist/lenet_train_test.prototxt -gpu 0

Error:

I0315 11:52:33.326529 1972355072 net.cpp:53] Initializing net from parameters: 
I0315 11:52:33.326954 1972355072 layer_factory.hpp:78] Creating layer mnist
I0315 11:52:33.331735 1972355072 net.cpp:101] Creating Layer mnist
I0315 11:52:33.331760 1972355072 net.cpp:431] mnist -> data
I0315 11:52:33.331787 1972355072 net.cpp:431] mnist -> label
I0315 11:52:33.334386 2215936 db_lmdb.cpp:38] Opened lmdb examples/mnist/mnist_train_lmdb
I0315 11:52:33.336191 1972355072 data_layer.cpp:41] output data size: 64,1,28,28
I0315 11:52:33.337107 1972355072 net.cpp:152] Setting up mnist
I0315 11:52:33.337129 1972355072 net.cpp:160] Top shape: 64 1 28 28 (50176)
I0315 11:52:33.337149 1972355072 net.cpp:160] Top shape: 64 (64)
I0315 11:52:33.337170 1972355072 layer_factory.hpp:78] Creating layer conv1
I0315 11:52:33.337187 1972355072 net.cpp:101] Creating Layer conv1
I0315 11:52:33.337193 1972355072 net.cpp:462] conv1 <- data
I0315 11:52:33.337216 1972355072 net.cpp:431] conv1 -> conv1
I0315 11:52:33.337373 1972355072 net.cpp:152] Setting up conv1
I0315 11:52:33.337389 1972355072 net.cpp:160] Top shape: 64 20 24 24 (737280)
I0315 11:52:33.337414 1972355072 layer_factory.hpp:78] Creating layer pool1
I0315 11:52:33.337425 1972355072 net.cpp:101] Creating Layer pool1
I0315 11:52:33.337430 1972355072 net.cpp:462] pool1 <- conv1
I0315 11:52:33.337436 1972355072 net.cpp:431] pool1 -> pool1
I0315 11:52:33.337515 1972355072 net.cpp:152] Setting up pool1
I0315 11:52:33.337527 1972355072 net.cpp:160] Top shape: 64 20 12 12 (184320)
I0315 11:52:33.337541 1972355072 layer_factory.hpp:78] Creating layer conv2
I0315 11:52:33.337549 1972355072 net.cpp:101] Creating Layer conv2
I0315 11:52:33.337554 1972355072 net.cpp:462] conv2 <- pool1
I0315 11:52:33.337560 1972355072 net.cpp:431] conv2 -> conv2
Build Status = -2 ( Err = -11 )
Log: No kernels or only kernel prototypes found when build executable.
Sources: __kernel void null() {
}
libc++abi.dylib: terminating with uncaught exception of type viennacl::ocl::invalid_program_executable: ViennaCL: FATAL ERROR: CL_INVALID_PROGRAM_EXECUTABLE.
If you think that this is a bug in ViennaCL, please report it at viennacl-support@lists.sourceforge.net and supply at least the following information:
 * Operating System
 * Which OpenCL implementation (AMD, NVIDIA, etc.)
 * ViennaCL version
Many thanks in advance!
*** Aborted at 1458067953 (unix time) try "date -d @1458067953" if you are using GNU date ***
PC: @     0x7fff975e3002 __pthread_kill
*** SIGABRT (@0x7fff975e3002) received by PID 7316 (TID 0x70000052f000) stack trace: ***
    @     0x7fff8b294eaa _sigtramp
    @         0x8d20cbcf (unknown)
    @     0x7fff97eeb6e7 abort
    @     0x7fff8d15cf81 abort_message
    @     0x7fff8d182a2f default_terminate_handler()
    @     0x7fff9b28a6c3 _objc_terminate()
    @     0x7fff8d18019e std::__terminate()
    @     0x7fff8d17fc12 __cxa_throw
I0315 11:52:33.338068 1972355072 net.cpp:152] Setting up conv2
naibaf7 commented 8 years ago

@jainanshul That means your OpenCL compiler was unable to compile one (or more) of the OpenCL kernels. Unfortunately, if the compiler does not throw an error by itself, it's currently hard to find out which kernels did not compile.

I'll try to figure out a fix for this.

jainanshul commented 8 years ago

For your information, result of caffe device_query

I0315 15:08:22.113821 1972355072 common.cpp:371] Total devices: 2
I0315 15:08:22.114439 1972355072 common.cpp:372] CUDA devices: 0
I0315 15:08:22.114446 1972355072 common.cpp:373] OpenCL devices: 2
I0315 15:08:22.114450 1972355072 common.cpp:397] Device id:                     0
I0315 15:08:22.114454 1972355072 common.cpp:399] Device backend:                OpenCL
I0315 15:08:22.114459 1972355072 common.cpp:401] Backend details:               Apple: OpenCL 1.2 (Dec  8 2015 17:02:20)
I0315 15:08:22.114480 1972355072 common.cpp:403] Device vendor:                 Intel
I0315 15:08:22.114486 1972355072 common.cpp:405] Name:                          Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
I0315 15:08:22.114491 1972355072 common.cpp:407] Total global memory:           17179869184
I0315 15:08:22.114496 1972355072 common.cpp:397] Device id:                     1
I0315 15:08:22.114500 1972355072 common.cpp:399] Device backend:                OpenCL
I0315 15:08:22.114503 1972355072 common.cpp:401] Backend details:               Apple: OpenCL 1.2 (Dec  8 2015 17:02:20)
I0315 15:08:22.114508 1972355072 common.cpp:403] Device vendor:                 Intel
I0315 15:08:22.114513 1972355072 common.cpp:405] Name:                          Iris Pro
I0315 15:08:22.114516 1972355072 common.cpp:407] Total global memory:           1610612736
naibaf7 commented 8 years ago

@jainanshul Does the error happen on both device 0 and 1, or only on the Iris Pro GPU? You could try the following: Remove all the lines ss << "#ifdef DOUBLE_SUPPORT_AVAILABLE" << "\n\n"; // NOLINT to ss << tile_double << "\n\n"; // NOLINT ss << "#endif" << "\n\n";

from cl_kernels.cpp, recompile and try to run again. So basically manually disable double support.

jainanshul commented 8 years ago

Happens on both device 0 and 1. If I omit flag GPU and run on CPU then I don't see an exception. Let me try your suggestion.

jainanshul commented 8 years ago

There was only one line in cl_kernels.cpp that I replace with

ss << tile_double << "\n\n"; // NOLINT
ss << "#endif" << "\n\n";

Still the same crash.

naibaf7 commented 8 years ago

I'm not sure you did what I meant... I proposed to remove all double kernels, which means to remove lines 107 to 139 from https://github.com/BVLC/caffe/blob/opencl/src/caffe/greentea/cl_kernels.cpp

However I'll come up with a way to identify the failing kernels individually until the end of the week, for proper debugging.

jainanshul commented 8 years ago

@naibaf7 sorry I misunderstood your intent. Deleting theses lines and recompiling doesn't fix the crash. However I was able to to fix the crash by applying https://gist.github.com/jainanshul/93e932cc9f31e96adf3d. This of course means the timings are all 0s but it shows the root cause of the crash.

naibaf7 commented 8 years ago

@jainanshul Right. Ironically, this was a change proposed by an Intel PR. Funny that it breaks for Iris Pro on Mac. I tried to make some changes in the benchmark code. Please try if this fixes it.