naibaf7 / caffe

Caffe: a fast open framework for deep learning. With OpenCL and CUDA support.
http://caffe.berkeleyvision.org/
Other
85 stars 20 forks source link

1.TestSimpleConvolution_Spatial tests fail on ARM platforms #31

Closed psyhtest closed 8 years ago

psyhtest commented 8 years ago

3 out of 9 tests in ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial* fail on at least two ARM platforms (Samsung Chromebook 2 and Odroid XU3) with e.g.

[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial
unknown file: Failure
C++ exception with description "ViennaCL: FATAL ERROR: CL_INVALID_VALUE.
If you think that this is a bug in ViennaCL, please report it at viennacl-support@lists.sourceforge.net
and supply at least the following information:
 * Operating System
 * Which OpenCL implementation (AMD, NVIDIA, etc.)
 * ViennaCL version
Many thanks in advance!" thrown in the test body.
...
[  FAILED  ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial, where TypeParam = caffe::GPUDevice<float>
[  FAILED  ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x2_caffenet_Conv5, where TypeParam = caffe::GPUDevice<float>
[  FAILED  ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial5x5, where TypeParam = caffe::GPUDevice<float>

It's clear that the problem is in ViennaCL (1.7.1), however, it's worth inspecting the Caffe tests to ensure that nothing obvious makes them fail.

LD_LIBRARY_PATH=/data/install/lib-openblas-v0.2.18/lib:$LD_LIBRARY_PATH \
/data/caffe-naibaf7/build/test/test_all.testbin \
--gtest_filter=ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial* \
> caffe-naibaf7.6c0fbdc.chromebook-2.Convolution_Spatial.log 2>&1

caffe-naibaf7.6c0fbdc.odroid-xu3.Convolution_Spatial.txt caffe-naibaf7.6c0fbdc.chromebook-2.Convolution_Spatial.txt

naibaf7 commented 8 years ago

CL_INVALID_VALUE actually can mean two things:

Since "Convolution_Spatial" is a convolution implementation by Intel, I will pass this on to them. @gongzg would you like to have a look at this?

gongzg commented 8 years ago

@naibaf7 I would like to check it. As I don't have the environment to reproduce this issue, @psyhtest could you please provide more information on it. Could you rebuild a debug version, and run the case under gdb then share the back trace stack to me. That will make things clearer. Thanks.

psyhtest commented 8 years ago

@gongzg With pleasure.

As the failures appear identical, I'm debugging:

LD_LIBRARY_PATH=/data/install/lib-openblas-v0.2.18/lib:$LD_LIBRARY_PATH \
/data/caffe-naibaf7-issue#31/build/test/test_all.testbin \
--gtest_filter=ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial

as follows:

$ gdb
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
...
(gdb) file /data/caffe-naibaf7-issue#31/build/test/test_all.testbin
(gdb) set args --gtest_filter=ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial
(gdb) set env LD_LIBRARY_PATH=/data/install/lib-openblas-v0.2.18/lib:$LD_LIBRARY_PATH
(gdb) catch throw
Catchpoint 1 (throw)
(gdb) r
Starting program: /data/caffe-naibaf7-issue#31/.build_debug/test/test_all.testbin --gtest_filter=ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial
...
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ConvolutionLayerTest_Spatial/1, where TypeParam = caffe::GPUDevice<float>
[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial
Catchpoint 1 (exception thrown), 0xb689b292 in __cxa_throw () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
(gdb) bt
#0  0xb689b292 in __cxa_throw () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#1  0x00143de4 in viennacl::ocl::error_checker<void>::raise_exception (err=-30)
    at /data/install/lib-viennacl-release-1.7.1/viennacl/ocl/error.hpp:610
#2  0x00141e8e in viennacl::ocl::error_checker<void>::checkError (err=-30) at /data/install/lib-viennacl-release-1.7.1/viennacl/ocl/error.hpp:675
#3  0xb5345a32 in viennacl::ocl::context::create_memory_without_smart_handle (this=0x8c5720, flags=40, size=4, ptr=0xbeffeb60)
    at /data/install/lib-viennacl-release-1.7.1/viennacl/ocl/context.hpp:205
#4  0xb53d0330 in viennacl::ocl::context::create_memory (this=0x8c5720, flags=8, size=4, ptr=0xbeffeb60)
    at /data/install/lib-viennacl-release-1.7.1/viennacl/ocl/context.hpp:218
#5  0xb53cd5f0 in caffe::ConvolutionLayerSpatial<float>::verify_result (this=0x10526e0, bottom=std::vector of length 2, capacity 2 = {...}, 
    top=std::vector of length 2, capacity 2 = {...}, index=0, numImages=2, config=0x18685c8) at src/caffe/layers/conv_layer_spatial.cu:688
#6  0xb53cedd0 in caffe::ConvolutionLayerSpatial<float>::setup_convolution (this=0x10526e0, bottom=std::vector of length 2, capacity 2 = {...}, 
    top=std::vector of length 2, capacity 2 = {...}) at src/caffe/layers/conv_layer_spatial.cu:1004
#7  0xb53cf732 in caffe::ConvolutionLayerSpatial<float>::Forward_gpu (this=0x10526e0, bottom=std::vector of length 2, capacity 2 = {...}, 
    top=std::vector of length 2, capacity 2 = {...}) at src/caffe/layers/conv_layer_spatial.cu:1109
#8  0x00169aee in caffe::Layer<float>::Forward (this=0x10526e0, bottom=std::vector of length 2, capacity 2 = {...}, 
    top=std::vector of length 2, capacity 2 = {...}) at ./include/caffe/layer.hpp:522
#9  0x0046eda0 in caffe::ConvolutionLayerTest_Spatial_TestSimpleConvolution_Spatial_Test<caffe::GPUDevice<float> >::TestBody_Impl (this=0x12eda88)
    at src/caffe/test/test_convolution_layer_spatial.cpp:218
#10 0x0046ec5e in caffe::ConvolutionLayerTest_Spatial_TestSimpleConvolution_Spatial_Test<caffe::GPUDevice<float> >::TestBody (this=0x12eda88)
    at src/caffe/test/test_convolution_layer_spatial.cpp:202
#11 0x004e0272 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0x12eda88, 
    method=&virtual testing::Test::TestBody(), location=0x6b4e3c "the test body") at src/gtest/gtest-all.cpp:3393
#12 0x004dcdc0 in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x12eda88, 
    method=&virtual testing::Test::TestBody(), location=0x6b4e3c "the test body") at src/gtest/gtest-all.cpp:3429
#13 0x004cdf2a in testing::Test::Run (this=0x12eda88) at src/gtest/gtest-all.cpp:3465
#14 0x004ce494 in testing::TestInfo::Run (this=0x88c828) at src/gtest/gtest-all.cpp:3641
#15 0x004ce8e4 in testing::TestCase::Run (this=0x88c608) at src/gtest/gtest-all.cpp:3748
#16 0x004d2346 in testing::internal::UnitTestImpl::RunAllTests (this=0x80d9f8) at src/gtest/gtest-all.cpp:5540
#17 0x004e0bae in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x80d9f8, 
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x4d2185 <testing::internal::UnitTestImpl::RunAllTests()>, location=0x6b5908 "auxiliary test code (environments or event listeners)") at src/gtest/gtest-all.cpp:3393
#18 0x004dd684 in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x80d9f8, 
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x4d2185 <testing::internal::UnitTestImpl::RunAllTests()>, location=0x6b5908 "auxiliary test code (environments or event listeners)") at src/gtest/gtest-all.cpp:3429
#19 0x004d177a in testing::UnitTest::Run (this=0x78e584 <_ZZN7testing8UnitTest11GetInstanceEvE8instance>) at src/gtest/gtest-all.cpp:5177
#20 0x0014165a in main (argc=1, argv=0xbefff464) at src/caffe/test/test_caffe_main.cpp:96
(gdb) c
unknown file: Failure
C++ exception with description "ViennaCL: FATAL ERROR: CL_INVALID_VALUE.
If you think that this is a bug in ViennaCL, please report it at viennacl-support@lists.sourceforge.net and supply at least the following information:
 * Operating System
 * Which OpenCL implementation (AMD, NVIDIA, etc.)
 * ViennaCL version
Many thanks in advance!" thrown in the test body.
[  FAILED  ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial, where TypeParam = caffe::GPUDevice<float> (78621 ms)
[----------] 1 test from ConvolutionLayerTest_Spatial/1 (78621 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (78622 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial, where TypeParam = caffe::GPUDevice<float>

 1 FAILED TEST

Any ideas?

gongzg commented 8 years ago

@psyhtest It looks like an issue in Viennacl which I has fixed a few weeks ago. Could you check your viennacl version? I put a pointer in the readme as below:

"Please use the latest git master viennacl which has the patch: https://github.com/viennacl/viennacl-dev/pull/181"

psyhtest commented 8 years ago

@gongzg Unfortunately, using the latest ViennaCL master doesn't seem to have helped on the Odroid XU3. I'll double check on the Chromebook 2.

gongzg commented 8 years ago

@psyhtest did you rebuild caffe after upgrade viennacl? and could you check the back trace with the latest viennacl, is there any difference? thanks.

psyhtest commented 8 years ago

@gongzg I thought I had made a clean build, but I rebuilt everything from scratch and it worked both on the Chromebook 2 and Odroid XU3. I saw a couple of weird messages "Verification of kernel was not successful,fallback to basic kernel" on the Chromebook on the first execution but not any more:

$ LD_LIBRARY_PATH=/data/install/lib-openblas-v0.2.18/lib:$LD_LIBRARY_PATH \
/data/caffe-naibaf7/build/test/test_all.testbin \
--gtest_filter=ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial*
Setting to use device 0
Note: Google Test filter = ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial*
[==========] Running 9 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 9 tests from ConvolutionLayerTest_Spatial/1, where TypeParam = caffe::GPUDevice<float>
[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial
[       OK ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial (488 ms)
[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3
Verification of kernel was not successful,fallback to basic kernel
[       OK ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3 (8233 ms)
[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3xPad1
Verification of kernel was not successful,fallback to basic kernel
[       OK ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3xPad1 (2278 ms)
[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial11x11x1x2_caffenet_Conv1
Verification of kernel was not successful,fallback to basic kernel
[       OK ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial11x11x1x2_caffenet_Conv1 (2462 ms)
[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial5x5x1x2_caffenet_Conv2
Verification of kernel was not successful,fallback to basic kernel
[       OK ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial5x5x1x2_caffenet_Conv2 (10895 ms)
[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x1_caffenet_Conv3
Verification of kernel was not successful,fallback to basic kernel
[       OK ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x1_caffenet_Conv3 (7701 ms)
[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x1_caffenet_Conv4
Verification of kernel was not successful,fallback to basic kernel
[       OK ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x1_caffenet_Conv4 (5803 ms)
[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x2_caffenet_Conv5
[       OK ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x2_caffenet_Conv5 (6683 ms)
[ RUN      ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial5x5
[       OK ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial5x5 (21090 ms)
[----------] 9 tests from ConvolutionLayerTest_Spatial/1 (65636 ms total)

[----------] Global test environment tear-down
[==========] 9 tests from 1 test case ran. (65637 ms total)
[  PASSED  ] 9 tests.

Many thanks for your help!