Open psyhtest opened 8 years ago
I'd need to know if there is a pattern on the data index:
(top_id, top_data_id, blob_id, feat_id)=0,0,0,0;
Can you find that out? Or just post some more index + values of the failures.
Not having one of those GPUs myself it is a bit difficult to track this problem. The runtests are fine on Intel, AMD and nVidia chips otherwise.
@psyhtest I'm going to run caffe on ARM, using opencl or cuda. But I have tried another opencl caffe version and finally I failed for the complex cross compile. Could you tell me, if you successfully use caffe on arm(opencl or cuda)?So nice as you if you can tell me some details. My mail is zhenght5@gmail.com
best regards to you.
@psyhtest The example you provide shows the test passes on the GPU for floats and the same test fails on the GPU for doubles. Are all the failed tests only double precision and on the GPU? If so, this suggests to me perhaps the MALI GPU does not support double, which is optional under the spec. Search for CL_DEVICE_DOUBLE_FP_CONFIG and clGetDeviceInfo, which is how the GPU indicates if it supports double.
@naibaf7 I'll attach a full log showing the Gradient related failures shortly.
@zhenghuitian Yes, I gave up on AMD's port of Caffe too because it used OpenCL 1.2 and C++ templates in kernels. But I did manage to run Caffe with a couple of patches to clBLAS v2.4. I'm also working on support for CLBlast. Our vision is to create an open framework for optimising CNNs on embedded platforms, which is outlined in our IWOCL abstract. All comments and contributions are welcome!
@jyegerlehner I strongly suspect that Mali does support double precision, as I was managing the OpenCL compiler team at ARM when it was implemented :). But perhaps I wasn't doing my job properly, and this omission somehow wasn't detected by conformance testing?.. :)
@psyhtest Hah hah OK I guess that rules that out. I thought it was a rather parsimonious theory though.
@psyhtest Thank you for your answer. It do help me a lot. I am preparing to use CLBlast instead of clBlas because my arm has not AMD gpu. I am reading your IWOCL abstract.Thank you again.
@naibaf7
Please see the full (compressed) log from running the following command:
LD_LIBRARY_PATH=/data/install/lib-openblas-v0.2.18/lib:$LD_LIBRARY_PATH \
/data/caffe-naibaf7/build/test/test_all.testbin --gtest_filter=*Gradient* \
> /chronos_downloads/caffe-naibaf7.6c0fbdc.Gradient.log 2>&1
...
[==========] 494 tests from 138 test cases ran. (39528323 ms total)
[ PASSED ] 384 tests.
[ FAILED ] 110 tests, listed below:
...
Also attached is my Makefile.config.
I also observed a similar failure on Odroid-XU3 (similar chip to Chromebook 2 but with the Mali driver v4.0, rather than v6.0):
[ RUN ] DeconvolutionLayerTest/2.TestGradient
./include/caffe/test/test_gradient_check_util.hpp:184: Failure
The difference between computed_gradient and estimated_gradient is 2, which exceeds threshold_ * scale, where
computed_gradient evaluates to 2,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.0020000000949949026.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,1,0; feat = 0.97146224975585938; objective+ = -1.53898024559021; objective- = -1.53898024559021
./include/caffe/test/test_gradient_check_util.hpp:184: Failure
The difference between computed_gradient and estimated_gradient is 2, which exceeds threshold_ * scale, where
computed_gradient evaluates to 2,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.0020000000949949026.
debug: (top_id, top_data_id, blob_id, feat_id)=0,1,1,0; feat = 0.97146224975585938; objective+ = -1.1870282888412476; objective- = -1.1870282888412476
It is, however, much more intermittent. (I could not reproduce it since.)
@psyhtest Yes.. I am currently looking if there are obvious parts of the code/kernels that could be problematic on these devices. After that I would like to do actual tests on the hardware.
@naibaf7
In configuration
USE_GREENTEA := 1
, I see lots of Caffe test failures on Samsung Chromebook 2 (ARM Cortex-A15 CPU, ARM Mali-T628 GPU) with this fork (latest commit04503ee
).What they all seem to have in common is the word "Gradient" in their name. For example:
The suspicious line is:
but sometimes I see the reverse of this situation when it is
computed_gradient evaluates to 0
, butestimated_gradient
evaluates to a non-zero.This happens both for
float
anddouble
tests.Any ideas?