solrex / caffe-mobile

Optimized (for size and speed) Caffe lib for iOS and Android with out-of-the-box demo APP.
Other
317 stars 121 forks source link

how can we add GPU support with Metal ? #1

Closed weixingsun closed 7 years ago

weixingsun commented 7 years ago

Hi @solrex

Thanks for creating this repo, It works for iOS, it looks great and promising. The example codes takes 5 seconds to recognize a 28x28 number picture on my iPhone 6s. I really want to see how I can help to use GPU to speed it up. There is an old repo by integrating Metal framework DeepLearningKit. but it uses swift, I am afraid the rewrite them to OC/C++ make take a pretty long time.

Let me know if you are working on it or how can I help.

solrex commented 7 years ago

The example codes takes 5 seconds to recognize a 28x28 number picture on my iPhone 6s.

It is unexpected. There must be something wrong with your phone or building environment. On my iPhone 5S, run the demo app CaffeSimple for the first time: Loading caffe model...129.169006ms Caffe infering...194.81996ms And the second run output: Caffe infering...74.63800ms

I really want to see how I can help to use GPU to speed it up. There is an old repo by integrating Metal framework DeepLearningKit. but it uses swift, I am afraid the rewrite them to OC/C++ make take a pretty long time.

Although there are some works on utilizing the power of GPU with Metal framework/Renderscript/OpenCL. They cannot directly integrated with Caffe. DeepLearningKit can import caffe model & weight. But it is another deep learning framework, compared to Caffe.

If you want Metal framework in Caffe, you must rewrite forward_gpu() function for EVERY layer used in Object-C, and add corresponding *_layer.mm for it. Similar work has been done for CUDA in *_layer.cu source files.

Let me know if you are working on it or how can I help.

The GPUs in mobile devices are not as powerful as the Nvidia family in desktop/server computer. So the GPU acceleration rate is much less than expected. More importantly, there are no CUDA like solutions for mobile GPU families(PowerVR, Mali, Adreno). Currently I'm looking at the trends but not working on it.

weixingsun commented 7 years ago

Thanks for your reply, so surprised to know your 200ms from cold start. I built my lib using the same script as you described, except "make -j 4" -> "make -j 2"(but it shouldn't be a problem) Could you please check if my models look the same as yours, or you can share your prototxt / models

model.zip log.txt

I'll check this first, and then head to the GPU, thanks

solrex commented 7 years ago

A quick glance at your net.prototxt shows that you chose an improper batch size for testing phase:

2,7c2,6
< input: "data"
< input_shape {
<   dim: 10000
<   dim: 1
<   dim: 28
<   dim: 28
---
> layer {
>   name: "data"
>   type: "Input"
>   top: "data"
>   input_param { shape: { dim: 64 dim: 1 dim: 28 dim: 28 } }

The first dim should be 1 instead of 10000. The 64 in my/net.prototxt is also improper. Test shows that the second run time reduces to 6.047ms if the first dim set to 1. Thank you for letting me be aware of it. I'll add a batch size checking step in README.md.

weixingsun commented 7 years ago

@solrex yes, you are right, I copy from my server. after changing to 1, the job is done in 14ms, and down to 5ms for the second time(that's what we should get from caffe). thanks for pointing it out. for GPU, I saw many codes are cuda related, not quite flexible to migrate. let me think about how to make a patch on metal for ios, and others for android.