hi,it will support gpu ,for example ARM mali gpu,

zif520 commented 8 years ago

hi sh1r0, I am very interested in your project,Are there plan to supply gpu? for example ARM mali opencl 1.1 gpu

sh1r0 commented 8 years ago

I would say that it's possible, but I'm not sure when. Currently, if you are interested in caffe w/ OpenCL support, you can refer to https://github.com/BVLC/caffe/pull/2610.

naibaf7 commented 8 years ago

@sh1r0 It should be possible to get BVLC/caffe#2610 to work on android. It can probably be done by replacing the Caffe used in this project by the https://github.com/naibaf7/caffe branch and some minor adaptions/fixes.

sh1r0 commented 8 years ago

@naibaf7 :+1: But I took a look at your branch, and I found that the commits are too much to make the branch like a patch to be easily applied to the upstream master branch. Would you like to rebase your branch?

naibaf7 commented 8 years ago

@sh1r0 Yes and I guess you would need to use 32 bit indexing (pointers) instead of 64 bit indexing for Android devices. What requirements would you have to be able to integrate this?

sh1r0 commented 8 years ago

@naibaf7 Yes, I guess so. :stuck_out_tongue: I think a branch which is rebased to the latest master branch (https://github.com/BVLC/caffe/commit/03a84bf464dd47bcec9ac943f0229a758c627f05) should be enough for me to have some trials. Thanks.

zif520 commented 8 years ago

@naibaf7 @sh1r0
hi,I've run it by compared naibaf7/caffe to caffe-android-lib ,it work well on cpu used EIGEN, but it will failed when run greentea_memset() on GPU mode(mali T880 opencl 1.1,use 32 bit indexing ). it failed when run viennacl::ocl::enqueue(),I am not familiar with opencl,so learn some about it to fix the problem later. could you give some suggestions to me? thanks!

naibaf7 commented 8 years ago

@zif520 Did you change int_tp and int_tpc both to 32bit types for both the OpenCL and C++ part of the code?

https://github.com/naibaf7/caffe/blob/master/include/caffe/definitions.hpp and https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/cl_headers/header.cl

however it might break if you just change it, so I'll verify and fix that.

I have a phone with an Adreno 330 GPU that should also be OpenCL ready... might try to fix it up myself :)... the great news is that OpenCL-Z (from PlayStore) reports a valid libOpenCL.so version 1.2 on that one!

zif520 commented 8 years ago

@naibaf7 there is still some troubles in it ,i will spent some days to fix it , and then https://github.com/amd/OpenCL-caffe/issues/17 said clblas 2.4 support opencl 1.1 ,i also will try it . OpenCL-Z reports my telephone only supply opencl1.1

naibaf7 commented 8 years ago

@zif520 I am currently making my branch ready for 32 bit indexing again, so that both 64 bit and 32 bit work. Then it should be able to compile and run on Android 1.1 devices.

It is not necessary to compile and use clBLAS with my branch, ViennaCL comes with a built-in BLAS that should work on mobile devices with OpenCL 1.1

Can you share what you have done so far? (adaptions, code, ...) that would speed up the process.

zif520 commented 8 years ago

@naibaf7 yes! i will share it when it is completed,it is popular to use caffe(mxnet and so on) on telephone ,many people wang to do that .

naibaf7 commented 8 years ago

@sh1r0 I currently don't have the time for a complete rebase - this has to wait a bit.

@zif520 What's the progress? Is it working with my latest updates?

zif520 commented 8 years ago

@naibaf7 i am sorry,i go home just beause of new year,i will come back 2016,01,04,

sh1r0 commented 8 years ago

@naibaf7 OK, that's fine. I tried to merge my branch onto yours for some trials in the early stage. To see my progress, you can take a look at opencl_dev.

And there are some issues I found according to my tests:

CPU does not work as a OpenCL device (runtime error)
Running on GPU is about 5 times slower than in pure CPU mode (CPU_ONLY with OpenBLAS)

Note: my test device is with Qualcomm Snapdragon 801 (Qualcomm Krait 400 and Qualcomm Adreno 330) and the support of OpenCL 1.2.

I'm not quite sure if I miss anything I need to take care of, as I'm not familiar with OpenCL. :p

Thanks.

bhack commented 8 years ago

@sh1r0 I don't know how amd clblas or viennacl backends are optimized for this kind of devices. Qualcomm has its own Snapdragon optimized BLAS implementation but it is still CPU only.

naibaf7 commented 8 years ago

@sh1r0 Ok cool, at least you got it working!

Now, what is the runtime error that you get with using the CPU on OpenCL? I use a CPU BLAS with CPU devices instead of ViennaCL-BLAS or clBLAS, so that might make issues here.

As for performance, it should definitely not be that slow. But to identify where the culprit is, I'd need to have some layer-wise timings to see what exactly runs slow. Maybe something I can also have a look at, as I have an Adreno 330 as well. Do you know how to do that quickly?

When you enumerate OpenCL devices, is the order the same as in OpenCL-Z?

sh1r0 commented 8 years ago

@naibaf7 Yes, it's really cool to have OpenCL works.

Sorry, I'm not sure what the problem might be, as I just got a segmentation fault when specifying CPU as the target device.

To get layer-wise timings, I think tools/caffe time is a good choice. However, with OpenCL build, I failed to make any executable run successfully on Android. I got ViennaCL: FATAL ERROR: Could not find kernel 'fillbuffer_float' from program '' for classification (cpp_classification) ~~and CANNOT LINK EXECUTABLE: cannot locate symbol "_ZN5caffe3NetIfEC1ERKSsNS_5PhaseEPKS1_" referenced by "./caffe"... for caffe~~. That's weird. EDIT: For caffe, got Segmentation fault.

Yes, the order are consistent to that in OpenCL-Z.

naibaf7 commented 8 years ago

@sh1r0 Ok thanks, I'll try to work out what's going wrong.

Might it be that the binaries do not call set_mode and SetDevice properly? ViennaCL: FATAL ERROR: Could not find kernel 'fillbuffer_float' from program '' would imply the OpenCL kernels weren't compiled.

zif520 commented 8 years ago

@sh1r0 you will refer @naibaf7 ' s code in caffe.cpp:test(),you add setdevices() will fix it,at first ,you will init device

naibaf7 commented 8 years ago

@zif520 Yes, here, it is also important to mention that the device must be set before any solver or network is loaded. Knowledge of which device should be used is ultimately required to compile kernels, allocate memory and dispatch network/solver initialization.

It is even possible to have multiple devices work on multiple networks in parallel, but then the rules are as follows:

Caffe must be initialized with SetDevices on the main thread, providing a complete list of the devices to be used.
SelectDevice can be used to switch the device. When initializing networks on the main thread, select the device before creating a network or solver on that device.
The networks can be trained in parallel by using multiple host threads. In every thread, SelectDevice can switch to a different device. This selection will be thread local.
This threading feature should also work when being used in Android AsyncTasks, Java Threads or in Python Multithreading (without getting into GIL locks), making it very convenient to use.

sh1r0 commented 8 years ago

@zif520 Thanks, I've got CPU working as a OpenCL device. (I used SetDevice only before.) But there might have other issues in tools/caffe such that it still does not work.

@naibaf7 I got some benchmark results, please refer to the link. time.cpp is basically caffe time. The number of iterations is 10 for cpu mode and 1 for gpu mode (as it takes ~6 minutes for a single iteration). I found that there are little difference between using cpu and gpu as the OpenCL device. And as for forward timings, gpu mode (OpenCL) is ~70x slower than cpu mode.

naibaf7 commented 8 years ago

@sh1r0 I think now you benchmarked the OpenCL GPU twice:

    Caffe::SetDevices(gpus);
    Caffe::set_mode(Caffe::GPU);
    Caffe::SetDevice(gpus[0]);

should be either:

    Caffe::set_mode(Caffe::GPU);
    Caffe::SetDevice(gpus[0]);

or:

    Caffe::set_mode(Caffe::GPU);
    Caffe::SetDevices(gpus);
    Caffe::SelectDevice(gpus[0], false);

Besides, I think the ViennaCL GEMM for convolution seems really unsuitable for the Adreno GPU then. I don't know of any BLAS that is optimized for mobile GPUs. Probably a better performance can even be reached by implementing a simple direct convolution instead of using an explicit GEMM at all. Maybe @karlrupp has an idea on this.

bhack commented 8 years ago

@naibaf7 A tuning issue on Adreno was opened at https://github.com/clMathLibraries/clBLAS/issues/136

naibaf7 commented 8 years ago

@bhack Thanks, good to know. However ViennaCL-BLAS seems to have optimization/tuning issues on this as well (which is what we are currently using in this Android-OpenCL experiment). It is a bit unfortunate, since nVidia has well optimized cuBLAS for most devices, while other vendors have basically nothing to offer (yet).

bhack commented 8 years ago

@naibaf7 Have you experimented with https://github.com/ptillet/isaac? Probably could be an alternative path if clBLAS continue to not attract contributions by other vendors. /cc @ptillet

bhack commented 8 years ago

Also Google https://github.com/halide/Halide/tree/master/apps/linear_algebra could be benchmarked on android.

naibaf7 commented 8 years ago

@bhack @zif520 @sh1r0 Added ISAAC compile support to CMake and GNU Makefiles on my branch, if anyone fancies to try. It did not speed up on my GT650 or Intel HD4000. Maybe it can work on mobile.

bhack commented 8 years ago

@ptillet What is the status?

karlrupp commented 8 years ago

@naibaf7 We had a detailed look at mobile GPUs over the summer. Our findings were fairly disappointing: Even with extensive code generation and autotuning we could not get anywhere close to peak (exception: NVIDIA Tegra). Even synthetic FLOP-intensive code did not work well, indicating that OpenCL compiler support still needs much more love.

naibaf7 commented 8 years ago

@karlrupp Thanks for clarification, even though that is not good news, it indicates that the fault lies with the compiler / OpenCL libraries of the vendors and not with our code.

It is very disappointing indeed, I thought by now there would be more efforts and interest by hardware vendors to have solutions ready to compete against nVidia.

sh1r0 commented 8 years ago

@naibaf7 Oh, I thought Net instance is running on the device which is specified in the constructor, so I did in this way:

Net<float> caffe_net(FLAGS_model, caffe::TRAIN, Caffe::GetDevice(device_id, true));

I would like to benchmark cpu (device 1) as a opencl device, and I tried both

Caffe::set_mode(Caffe::GPU);
Caffe::SetDevice(gpus[1]);

and

Caffe::set_mode(Caffe::GPU);
Caffe::SetDevices(gpus);
Caffe::SelectDevice(gpus[1], false);

with

Net<float> caffe_net(FLAGS_model, caffe::TRAIN, Caffe::GetDefaultDevice());

I got segmentation fault during runtime on both cases. :(

To all: It's really disappointing to know that.

zif520 commented 8 years ago

@naibaf7
I had tested your newest code, the problem of 64bit is fixed ,my telephone is HUAWEI mate8 with mali T880 It spend 1120ms with GPU mode but 500ms with CPU mode (with openblas),then i will test it with ISSA later

and i found that it is very slow at first forward,

naibaf7 commented 8 years ago

@zif520 @sh1r0

yes the first pass will be extra slow due to memory allocation and kernel compilation.

bhack commented 8 years ago

In Isaac there is a subdirectory tune/android but in the ini seems to me that actually coner only Intel on android

zif520 commented 8 years ago

@naibaf7 @sh1r0 I had tested in ubuntu with clblas 2.4, GPU is nvidia gtx780,run 1000 iters on mnist only ViennaCL:108s use clblas:86s

clblas 2.8 only support opencl 1.2 ,so i test it with clblas 2.4 first. and then do something as @hughperkins suggested https://github.com/amd/OpenCL-caffe/issues/17

I will move it to android later

mkaskov commented 8 years ago

May be try to use RenderScript? why opencl?

bhack commented 8 years ago

@mkaskov Have you seen halide blas?

naibaf7 commented 8 years ago

@mkaskov Well that would require to write a new backend. Also: http://stackoverflow.com/questions/14385843/why-did-google-choose-renderscript-instead-of-opencl

In theory, they should be able to perform equally well, given there is a reasonably optimized BLAS, and a good OpenCL compiler on that platform.

zif520 commented 8 years ago

hi @naibaf7 you had said "yes the first pass will be extra slow due to memory allocation and kernel compilation." we get a software of caffe ,it is used in my telephone ,it cost 3.4s at first pass, and 1.7s normal,there use vggnet.

but our code spend 8s at first pass ,800ms normal,are there any optimization to reduce the first pass time? I am not familiar with opencl : )

naibaf7 commented 8 years ago

@zif520 I am abit confused what you mean... in general, no. Mem allocation and compile can not be done faster.

In test mode though, the memory footprint can be reduced, which will also speed it up.

zif520 commented 8 years ago

@naibaf7 i dont konw whether it can build offline, just as it: https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/online-offline-compilation/

bhack commented 8 years ago

Seems that official Rendescript optimized version of blas was added in Android API level 23

naibaf7 commented 8 years ago

We could probably add a Renderscript backend to Caffe once I've completed the device/backend abstraction... replicating the math_functions "middle ware" should be fairly quick, then all that would be left to do is adding the custom kernels and kernel launch support.

If anyone wants to do that approach, just contact me.

bhack commented 8 years ago

Investing time on caffe core it is at high risk actually

naibaf7 commented 8 years ago

@bhack Would you elaborate on this?

bhack commented 8 years ago

@naibaf7 I don't know if you have direct contact with BVLC members but seems to me that they are totally out of resource to handling the project and community. What are the prospective? Code fragmentation with hundred of forks? I think we have waited enough conference deadlines to see a clear roadmap https://github.com/BVLC/caffe/issues/2313

naibaf7 commented 8 years ago

@bhack I've tried to contact them a few times. It is currently indeed very hard to get everyone together who should be collaborating. Tomorrow I'll have a conference call with Intel's Beignet maintainer/coordinator, I'm also in contact with AMD's Junli Gu about proceeding on OpenCL. Evan Shelhamer didn't get back at me after his last comments on https://github.com/BVLC/caffe/pull/2610.

However, I'm still not discouraged. If hardware developers want to collaborate and optimize for the devices, I'll collaborate happily. New backends are also welcome.

bhack commented 8 years ago

@naibaf7 Also absence of a comment to https://github.com/Yangqing/caffe2/issues/22 give me more fog for the framework future prespective and more disincentive to invest in caffe.

bhack commented 8 years ago

As I supposed https://github.com/tensorflow/tensorflow/issues/663#issuecomment-171500553

zif520 commented 8 years ago

it is great to hear that ，we will try to optimize opencl at first : )

bhack commented 8 years ago

@naibaf7 http://www.embedded-vision.com/industry-analysis/technical-articles/caffe-deep-learning-framework-interview-core-developers

sh1r0 / caffe-android-lib

hi,it will support gpu ,for example ARM mali gpu, #23