Open zif520 opened 8 years ago
@naibaf7 @bhack @sh1r0 clblas 2.4 is useful for telephone,it only cost 300ms on alexnet against 800ms without clblas .
@zif520 I don't know whether it is good to ask here, but could you share the install instruction for clblas? I can not install it in Mali-T628. Although I searched for the Internet, I didn't find useful information. When cmake, I encounter
I plan to run caffe in ARM Mali GPU as you did. If you can share some information with me, that will be great. Thanks
@zif520 @sambookhon @sh1r0 Please use the following branch for OpenCL from now on: https://github.com/BVLC/caffe/tree/opencl
@sambookhon 1.deleteo ption '-m32',Android dont support it 2.i am not encountered that error,could you share more information?
and i go home now as chinese new year,i will give you my cmake detail back.:)
@naibaf7 It is great to hear that ! :+1: I am learning some knowledge about opencl,some as "opencl in action" and so on,perhaps i also can give some help for that branch later :)
@zif520 Thanks for your sharing. Do I post my information here (I am afraid of distracting this post?) OR sending by email (my email is "fishfrank23" "@" "gmail.com"). Happy Chinese New Year.
/cc @krikru
@zif520 Sorry to bother you again. Could you provide your cmake detail? Thanks.
This thread is very interesting. I've been trying to get caffe to work on android. The results seem to be surprising: caffe running with Mali gpu seems to be 2-3 slower than cpu, but about 4-5x more energy efficient. The test was run on Galaxy S6 (Mali T760, Peak Performance 200 GFlops).
Since GEMM is the core of convolution in caffe, I decided to profile its performance on Android. It seems that ViennaCL is not as efficient as some simple kernels. Now I am able to get GPU run as fast as CPU for large matrices (2k x 2k). This is still counter-intuitive, since normally we expect GPUs to be much faster.
See: https://github.com/strin/mocha-profile
The kernel implementations can be found here:
OpenCL kernels for GEMM: https://github.com/strin/gemm-android
Any thoughts?
@sh1r0 did you get a chance to integrate the code from https://github.com/BVLC/caffe/tree/opencl to your opencl_dev branch. Also I saw some references of opencl port being slow and wondering if it is slower than CPU only or compared to CUDA?
@jainanshul I'm working on an own implementation of convolutions for OpenCL to make it faster while also reducing memory usage. I think this should also help on ARM/Android devices.
@naibaf7 would it be able to use the existing caffe models?
@jainanshul Yes. https://github.com/BVLC/caffe/tree/opencl is fully Caffe compatible with all existing models.
Ah you were talking about https://github.com/BVLC/caffe/tree/opencl, but at this moment it requires CUDA to be installed to use opencl. On my android device I don't have a Nvidia GPU so no CUDA available. Anyway to try opencl without requiring CUDA?
@jainanshul It should be possible to disable CUDA and cuDNN in the Makefile.config (or also in the CMake configuration) If that is not the case, please raise an issue so that I can fix the offending code.
@naibaf7 Do you know if there is any equivalence of cuDNN for OpenCL? cuDNN is basically a library of primitives for DNNs that utilizes CUDA, and as such can be used from any deep learning framework. If there was an equivalence for OpenCL we wouldn't need to have so many different implementations of basically the same functionality (but for different frameworks), but could just use the same implementation for all frameworks. I believe this would also make the implementation faster as it would unite people from different projects to work on the same OpenCL implementation.
We are looking at something like @krikru described for the Opencv GSOC of this year (if the Opencv organization will be accepted again). @naibaf7 Are you still eligible for GSOC and interested?
@bhack I unfortunately don't have time to do this. However I am working on a fast flexible cuDNN replacement for OpenCL at the moment. The forwarding is implemented, now I'm writing the autotuning and backward function.
@naibaf7 the above work would it be part of your opencl branch and what is the timeline you are looking at?
@naibaf7 Nice! If it is generic enough and BSD licensed compatible we could evaluate to put a student to integrate on this.
@bhack I think it could be :) it also supports grouping, dilation, stride, padding, N-dimensional. There's still quite some work left to do before I dare to release it to the public though :)
Interesting thread. Learn a lot.
I am interesting on providing a BLAS implementation for mobile GPUs. (For CPU, I suggest OpenBLAS,haha)
As @bhack mentioned, Google released ScriptIntrinsicBLAS and RenderScript. Is it a good idea to use OpenCL for Android?
@xianyi It would be possible to use RenderScript if someone is willing to write a whole backend for it. OpenCL could be just as fast, but sometimes the provided implementations by the vendors are lacking. @karlrupp tested and knows a lot about that. He mentioned that even simple synthetic OpenCL scripts do not reach the peak performance of those mobile chips.
@naibaf7 Nice! I believe that would be really valuable. The most important thing to get right before you release it is the interface because it will be hard to change that afterwards, the rest can be improved afterwards. I'm looking forward to seeing it released.
@naibaf7 @xianyi We need to think that Vulkan it is released now. So also SPIR-V is a strategic target also for Android in the very near future. SyCL propose single source and I think that could be taken in serious consideration.
@sambookhon sorry for waiting :) use clblas 2.4 ,it support opencl 1.1 1.at first you will bulid it on ubuntu ,use"cmake ..;make;make install", you will set BUILD_TEST BUILD_PERFORMANCE BUILD_SAMPLE BUILD_CLIENT BUILD_KTEST "OFF",dont build that option, set( Boost_USEMULTITHREADED OFF ),set( Boost* OFF ),all boost OFF; 2.then build it on NDK, delete -m${TARGET_PLATFORM}, 3.delete set(TIME_LIBRARY "rt") in \clblas\library\tools\tone\cmakelists.txt,as android dont support -lrt
all ,you will build it :) good luck ,
@strin
"(Mali T760, Peak Performance 200 GFlops)."
i test it only 74GFlops with opencl x
we test huawei mate8 with T880MP4 only 72GFlops,but half float will faster;
alexnet forward will cost 300ms,but vgg forward cost 3s ,use clblas2.4
s7 will supply T880MP14,it is powerful !!!
@zif520 , Thank you for the data. Is there a improvement room for BLAS or DNN library on mobile GPU?
@xianyi alexnet: (1)ViennaCL cost 800ms (2)clblas 2.4 cost 300ms,i also test clblas 2.6 (delete opencl 1.2 function ,gemm not use them,it also cost 300ms) so,opencl give a way called AutoGemm on clblas 2.10, https://github.com/clMathLibraries/clBLAS/wiki/AutoGemm it is used python ,i cant run it on mobile. (3) half float is useful ,16 bit float.
but i am not familiar with blas and opencl :)
ps:we test open-blas on mobile cpu,it is faster than eigen , cpu with 4 core will cost 250ms ,faster than gpu now.
@zif520 , I think AutoGemm is useful for AMD GPU, which may not be suitable for mobile GPU.
I think OpenBLAS with ARMV7 kernel is not full optimized on your testbed. We just released OpenBLAS CortexA57 kernels for AArch64. Meanwhile, I want to introduce OpenVML project. https://github.com/xianyi/OpenVML We implement powx, exp on vector by ARM Neon instructions.
@xianyi Is it support CortexA57 only?can i try it on A72?
You can try it on A72 by make TARGET=CORTEXA57
.
@zif520 i think Galaxy S6 comes with Mali T760 MP8. According to http://kyokojap.myweb.hinet.net/gpu_gflops/, the peak gflops is 200. I also ran a benchmark, and got something close to ~74 GFlops.
@strin perhaps the benchmark only use 4 cores, http://kyokojap.myweb.hinet.net/gpu_gflops/ is right for kirin 950
@xianyi I had tested on caffenet ,it will cost 300ms ,4 cores target=A57 and openmp, but EIGEN only cost 200ms with 4 cores
@zif520, thank you for the testing.
@zif520 is the above result with OpenCL on ARM?
@bhack I'm interested in the gsoc thing
@edgarriba Try to ask to @naibaf7 if you can give some contribution in the meantime.
@bhack @naibaf7 oki! But what you are suggesting is an update for Caffe with OpenCL, right? I'm not sure how will fit in OpenCV since as I understood it has an own implementation of Caffe. I'm also interested in the training part at the same abstraction level as they do in Keira, lasagne, among others. Not sure if it will work. If you want we can discuss that in the forum.
@edgarriba I've added @naibaf7 to the group. Please ask there.
@jainanshul is arm support opencl? i had see that arm will support it future, my test is base on neon
@zif520 some arm vendors do provide opencl implementations.
@jainanshul Can you give some examples for that?
@bhack @naibaf7 nice! just posted there
@zif520 newer qualcomm's snapdragon support opencl on ARM
@naibaf7
I'm working on an own implementation of convolutions for OpenCL to make it faster while also reducing memory usage. I think this should also help on ARM/Android devices.
Is there any update on this? So far what I have seen from this thread it seems like openCL GPU runs slower than CPU. I will be experimenting with opencl caffe on an Android device and would update the results on this thread. In the meanwhile if you made any progress towards optimizing performance for mobile devices, please let me know.
@jainanshul I have a forward kernel that could be optimized / tested on mobile if you are interested. The backward kernel is a bit more complicated and I'm still working on that, with a planned initial release within 2-3 weeks. Let me know if you feel like experimenting with it, then I can send you a pre-release of the forward kernel code and verification tests.
@jainanshul Is it 820? 820's dsp is srtong,1024bit simd,but kirin 950's neon is 128bit; we have not phone with 820,Could you share your result to us?
and I had heared that forward on 820 CPU mode only cost 50ms with alex;
@naibaf7 could you share the forward kernel for us to test? I have not much progress on opencl
@zif520 the chip I am using has Adreno 510 GPU. I will share the results with in a few days. @naibaf7 please shared your experimental code when you can and I would be happy to try it.
hi sh1r0, I am very interested in your project,Are there plan to supply gpu? for example ARM mali opencl 1.1 gpu