sh1r0 / caffe-android-lib

Porting caffe to android platform
Other
510 stars 204 forks source link

hi,it will support gpu ,for example ARM mali gpu, #23

Open zif520 opened 8 years ago

zif520 commented 8 years ago

hi sh1r0, I am very interested in your project,Are there plan to supply gpu? for example ARM mali opencl 1.1 gpu

zif520 commented 8 years ago

@naibaf7 @bhack @sh1r0 clblas 2.4 is useful for telephone,it only cost 300ms on alexnet against 800ms without clblas .

sambookhon commented 8 years ago

@zif520 I don't know whether it is good to ask here, but could you share the install instruction for clblas? I can not install it in Mali-T628. Although I searched for the Internet, I didn't find useful information. When cmake, I encounter

  1. error: unrecognized command line option '-m32'
  2. CMakeFiles/Makefile2:109: recipe for target 'library/CMakeFiles/clBLAS.dir/all' failed

I plan to run caffe in ARM Mali GPU as you did. If you can share some information with me, that will be great. Thanks

naibaf7 commented 8 years ago

@zif520 @sambookhon @sh1r0 Please use the following branch for OpenCL from now on: https://github.com/BVLC/caffe/tree/opencl

zif520 commented 8 years ago

@sambookhon 1.deleteo ption '-m32',Android dont support it 2.i am not encountered that error,could you share more information?

and i go home now as chinese new year,i will give you my cmake detail back.:)

zif520 commented 8 years ago

@naibaf7 It is great to hear that ! :+1: I am learning some knowledge about opencl,some as "opencl in action" and so on,perhaps i also can give some help for that branch later :)

sambookhon commented 8 years ago

@zif520 Thanks for your sharing. Do I post my information here (I am afraid of distracting this post?) OR sending by email (my email is "fishfrank23" "@" "gmail.com"). Happy Chinese New Year.

bhack commented 8 years ago

/cc @krikru

sambookhon commented 8 years ago

@zif520 Sorry to bother you again. Could you provide your cmake detail? Thanks.

strin commented 8 years ago

This thread is very interesting. I've been trying to get caffe to work on android. The results seem to be surprising: caffe running with Mali gpu seems to be 2-3 slower than cpu, but about 4-5x more energy efficient. The test was run on Galaxy S6 (Mali T760, Peak Performance 200 GFlops).

Since GEMM is the core of convolution in caffe, I decided to profile its performance on Android. It seems that ViennaCL is not as efficient as some simple kernels. Now I am able to get GPU run as fast as CPU for large matrices (2k x 2k). This is still counter-intuitive, since normally we expect GPUs to be much faster.

See: https://github.com/strin/mocha-profile

The kernel implementations can be found here:

OpenCL kernels for GEMM: https://github.com/strin/gemm-android

Any thoughts?

jainanshul commented 8 years ago

@sh1r0 did you get a chance to integrate the code from https://github.com/BVLC/caffe/tree/opencl to your opencl_dev branch. Also I saw some references of opencl port being slow and wondering if it is slower than CPU only or compared to CUDA?

naibaf7 commented 8 years ago

@jainanshul I'm working on an own implementation of convolutions for OpenCL to make it faster while also reducing memory usage. I think this should also help on ARM/Android devices.

jainanshul commented 8 years ago

@naibaf7 would it be able to use the existing caffe models?

naibaf7 commented 8 years ago

@jainanshul Yes. https://github.com/BVLC/caffe/tree/opencl is fully Caffe compatible with all existing models.

jainanshul commented 8 years ago

Ah you were talking about https://github.com/BVLC/caffe/tree/opencl, but at this moment it requires CUDA to be installed to use opencl. On my android device I don't have a Nvidia GPU so no CUDA available. Anyway to try opencl without requiring CUDA?

naibaf7 commented 8 years ago

@jainanshul It should be possible to disable CUDA and cuDNN in the Makefile.config (or also in the CMake configuration) If that is not the case, please raise an issue so that I can fix the offending code.

krikru commented 8 years ago

@naibaf7 Do you know if there is any equivalence of cuDNN for OpenCL? cuDNN is basically a library of primitives for DNNs that utilizes CUDA, and as such can be used from any deep learning framework. If there was an equivalence for OpenCL we wouldn't need to have so many different implementations of basically the same functionality (but for different frameworks), but could just use the same implementation for all frameworks. I believe this would also make the implementation faster as it would unite people from different projects to work on the same OpenCL implementation.

bhack commented 8 years ago

We are looking at something like @krikru described for the Opencv GSOC of this year (if the Opencv organization will be accepted again). @naibaf7 Are you still eligible for GSOC and interested?

naibaf7 commented 8 years ago

@bhack I unfortunately don't have time to do this. However I am working on a fast flexible cuDNN replacement for OpenCL at the moment. The forwarding is implemented, now I'm writing the autotuning and backward function.

jainanshul commented 8 years ago

@naibaf7 the above work would it be part of your opencl branch and what is the timeline you are looking at?

bhack commented 8 years ago

@naibaf7 Nice! If it is generic enough and BSD licensed compatible we could evaluate to put a student to integrate on this.

naibaf7 commented 8 years ago

@bhack I think it could be :) it also supports grouping, dilation, stride, padding, N-dimensional. There's still quite some work left to do before I dare to release it to the public though :)

xianyi commented 8 years ago

Interesting thread. Learn a lot.

I am interesting on providing a BLAS implementation for mobile GPUs. (For CPU, I suggest OpenBLAS,haha)

As @bhack mentioned, Google released ScriptIntrinsicBLAS and RenderScript. Is it a good idea to use OpenCL for Android?

naibaf7 commented 8 years ago

@xianyi It would be possible to use RenderScript if someone is willing to write a whole backend for it. OpenCL could be just as fast, but sometimes the provided implementations by the vendors are lacking. @karlrupp tested and knows a lot about that. He mentioned that even simple synthetic OpenCL scripts do not reach the peak performance of those mobile chips.

krikru commented 8 years ago

@naibaf7 Nice! I believe that would be really valuable. The most important thing to get right before you release it is the interface because it will be hard to change that afterwards, the rest can be improved afterwards. I'm looking forward to seeing it released.

bhack commented 8 years ago

@naibaf7 @xianyi We need to think that Vulkan it is released now. So also SPIR-V is a strategic target also for Android in the very near future. SyCL propose single source and I think that could be taken in serious consideration.

zif520 commented 8 years ago

@sambookhon sorry for waiting :) use clblas 2.4 ,it support opencl 1.1 1.at first you will bulid it on ubuntu ,use"cmake ..;make;make install", you will set BUILD_TEST BUILD_PERFORMANCE BUILD_SAMPLE BUILD_CLIENT BUILD_KTEST "OFF",dont build that option, set( Boost_USEMULTITHREADED OFF ),set( Boost* OFF ),all boost OFF; 2.then build it on NDK, delete -m${TARGET_PLATFORM}, 3.delete set(TIME_LIBRARY "rt") in \clblas\library\tools\tone\cmakelists.txt,as android dont support -lrt

all ,you will build it :) good luck ,

zif520 commented 8 years ago

@strin
"(Mali T760, Peak Performance 200 GFlops)." i test it only 74GFlops with opencl x we test huawei mate8 with T880MP4 only 72GFlops,but half float will faster; alexnet forward will cost 300ms,but vgg forward cost 3s ,use clblas2.4 s7 will supply T880MP14,it is powerful !!!

xianyi commented 8 years ago

@zif520 , Thank you for the data. Is there a improvement room for BLAS or DNN library on mobile GPU?

zif520 commented 8 years ago

@xianyi alexnet: (1)ViennaCL cost 800ms (2)clblas 2.4 cost 300ms,i also test clblas 2.6 (delete opencl 1.2 function ,gemm not use them,it also cost 300ms) so,opencl give a way called AutoGemm on clblas 2.10, https://github.com/clMathLibraries/clBLAS/wiki/AutoGemm it is used python ,i cant run it on mobile. (3) half float is useful ,16 bit float.

but i am not familiar with blas and opencl :)

ps:we test open-blas on mobile cpu,it is faster than eigen , cpu with 4 core will cost 250ms ,faster than gpu now.

xianyi commented 8 years ago

@zif520 , I think AutoGemm is useful for AMD GPU, which may not be suitable for mobile GPU.

I think OpenBLAS with ARMV7 kernel is not full optimized on your testbed. We just released OpenBLAS CortexA57 kernels for AArch64. Meanwhile, I want to introduce OpenVML project. https://github.com/xianyi/OpenVML We implement powx, exp on vector by ARM Neon instructions.

zif520 commented 8 years ago

@xianyi Is it support CortexA57 only?can i try it on A72?

xianyi commented 8 years ago

You can try it on A72 by make TARGET=CORTEXA57.

strin commented 8 years ago

@zif520 i think Galaxy S6 comes with Mali T760 MP8. According to http://kyokojap.myweb.hinet.net/gpu_gflops/, the peak gflops is 200. I also ran a benchmark, and got something close to ~74 GFlops.

zif520 commented 8 years ago

@strin perhaps the benchmark only use 4 cores, http://kyokojap.myweb.hinet.net/gpu_gflops/ is right for kirin 950

zif520 commented 8 years ago

@xianyi I had tested on caffenet ,it will cost 300ms ,4 cores target=A57 and openmp, but EIGEN only cost 200ms with 4 cores

xianyi commented 8 years ago

@zif520, thank you for the testing.

jainanshul commented 8 years ago

@zif520 is the above result with OpenCL on ARM?

edgarriba commented 8 years ago

@bhack I'm interested in the gsoc thing

bhack commented 8 years ago

@edgarriba Try to ask to @naibaf7 if you can give some contribution in the meantime.

edgarriba commented 8 years ago

@bhack @naibaf7 oki! But what you are suggesting is an update for Caffe with OpenCL, right? I'm not sure how will fit in OpenCV since as I understood it has an own implementation of Caffe. I'm also interested in the training part at the same abstraction level as they do in Keira, lasagne, among others. Not sure if it will work. If you want we can discuss that in the forum.

bhack commented 8 years ago

@edgarriba I've added @naibaf7 to the group. Please ask there.

zif520 commented 8 years ago

@jainanshul is arm support opencl? i had see that arm will support it future, my test is base on neon

jainanshul commented 8 years ago

@zif520 some arm vendors do provide opencl implementations.

zif520 commented 8 years ago

@jainanshul Can you give some examples for that?

edgarriba commented 8 years ago

@bhack @naibaf7 nice! just posted there

jainanshul commented 8 years ago

@zif520 newer qualcomm's snapdragon support opencl on ARM

jainanshul commented 8 years ago

@naibaf7

I'm working on an own implementation of convolutions for OpenCL to make it faster while also reducing memory usage. I think this should also help on ARM/Android devices.

Is there any update on this? So far what I have seen from this thread it seems like openCL GPU runs slower than CPU. I will be experimenting with opencl caffe on an Android device and would update the results on this thread. In the meanwhile if you made any progress towards optimizing performance for mobile devices, please let me know.

naibaf7 commented 8 years ago

@jainanshul I have a forward kernel that could be optimized / tested on mobile if you are interested. The backward kernel is a bit more complicated and I'm still working on that, with a planned initial release within 2-3 weeks. Let me know if you feel like experimenting with it, then I can send you a pre-release of the forward kernel code and verification tests.

zif520 commented 8 years ago

@jainanshul Is it 820? 820's dsp is srtong,1024bit simd,but kirin 950's neon is 128bit; we have not phone with 820,Could you share your result to us?

and I had heared that forward on 820 CPU mode only cost 50ms with alex;

@naibaf7 could you share the forward kernel for us to test? I have not much progress on opencl

jainanshul commented 8 years ago

@zif520 the chip I am using has Adreno 510 GPU. I will share the results with in a few days. @naibaf7 please shared your experimental code when you can and I would be happy to try it.