Which one is the best and Top priority to do Perf, CLTorch, deepCL, CLNN, clCaffe?

fsword73 commented 8 years ago

Hi, Hugh If I want to cost 6 months or more to do perofrmance profiling/tuning, Can you give an priority of following: * clTorch cl-Caffe * deppCL * clNN other frameworks of Deep Convolutional Nerual Network

Can you give some basic concept here?   
I spent 3 weeks on deepCL kernels.  I believe thhere are some chances to improve a lot for some kernels.  I did not go into clTorch, clCaffe and clNN. 
     For example,     Current BackpropWeightsScratch uses BatchSize in Kernel so that BackpropWeightsScratch is limited by Latency. For example,  BatchSize 128 means 127x longer latency for small [inputplane][outputplane].    The small change can use [inputplane][outputplane] * 64 as global threads.  So 1 workgroup of 64 threads can deal with 128 BatchSize to  512 BatchSize.  At the end of the kenrel,  1 reduction sums 128--512 batchSize of reuslt into final result.  The latency will be reduced to 1/64 in theory.  The reduction should be very fast since it uses only 4 Local memory access, and 4 adds. 
   Thanks for your time ahead.

hughperkins commented 8 years ago

Well... that's a good question :-) So, how about, we focus on the convolutional layer? That is the main bottleneck. And rather than optimizing the conv layer in each library one by one, how about we switch all the libraries to use @naibaf7 's greentea convolutional layer? Are you coming mostly from a 'for fun' angle? Or for a university/research project? Or as part of your job? If it is for academic research, or as part of your job, you might be able to get a very quick win by doing:

run benchmarks across all of deepcl, cltorch, cl-caffe now, record those results somewhere
migrate cltorch, and possibly also deepcl, to use greentea convolutions
(hopefully) show a speed boost
... and then start looking into how you can further optimize greentea convolutions :-)

It will be easier to migrate cltorch to use greentea convolutions, than deepcl. Since deepcl needs to run on windows, and be compilable using msvc2008 and msvc2010, so it can work with python 2.7 and python 3.4.

Alternatively, you can just look at making very fast convolutions in greentea, and I can handle migrating cltorch, and possibly also deepcl, onto greentea convolutional layer.

I think we might as well all work on optimizing the same convolutional library probably. Though, I dont know, depends somewhat on your own goals, how you see it? I suppose another option is you could create an entirely new convolutional implementation too. But, I guess that would be a lot of work. Maybe better to eg pick one particular hardware (AMD R9 Fury?), and one particular geometry (3 by 3 kernels with about the appropriate dimensions for residual learning ?), and make those go super fast, and fit that into greentea somehow?

What are your own goals? How do you see your own work fitting into the existing libraries and frameworks?

fsword73 commented 8 years ago

I planned to have 1-2 interns from university. I hope to get approval from my boss's boss in next a few weeks.

From the convnet benchmark, greentea's convolutional layer has gap with cuDNN. Your suggestion is great. The base hardware can be AMD R9 fury . I have some basic testing in ministry. The major issue is backprop.

IT is good point to optimize one convolutional library only. It will save a lot of time.

Which os version do you recommend to run these benchmarks?

I have profiled deepcl with mnist on windiws7.

fsword73 commented 8 years ago

My initial plan is to rewrite 3x3 filters to 21x21 filters. After profiling of mnist, back prop weights is major performance lost. We can begin from top5 kernels. After that, we can rewrite top10. i agree that convolutional layer is the most important one.

hughperkins commented 8 years ago

I planned to have 1-2 interns from university. I hope to get approval from my boss's boss in next a few weeks.

Ok, sounds good :-)

From the convnet benchmark, greentea's convolutional layer has gap with cuDNN. Your suggestion is great.

Ok :-)

The base hardware can be AMD R9 fury .

Sounds good to me. Furies are about as fast as W9100, but significantly cheaper. The only other option is R9-390X, which has slightly more memory, but slower flops. It will be easier to get informal support from AMD for optimizing on Fury than optimizing on R9-390X.

IT is good point to optimize one convolutional library only. It will save a lot of time.

Ok

Which os version do you recommend to run these benchmarks?

Personally I use Ubuntu 14.04 and 16.04. I believe that Torch users are mostly using Ubuntu 14.04. Windows is good too though :-)

I have profiled deepcl with mnist on windiws7.

mnist is kind of a toy really. A recent state of the art network is Microsoft's residual network. You ideally want to target imagnet. You can use cifar as a playground, eg cifar-torch

Or, if you want to target mnist-sized things, I guess you could target alphago

Just some ideas :-)

hughperkins commented 8 years ago

(well... you might consider targeting soumith's convnet-benchmarks Note that these certianly run on Ubuntu though. )

fsword73 commented 8 years ago

Soumith's convent-benchmarks is small and much low time consuming one to start performance tuning.

Imagenet sounds a big one or Microsoft residual networks. now I almost have the clear goal to apply for interns.

Thanks your time!

hughperkins / DeepCL

Which one is the best and Top priority to do Perf, CLTorch, deepCL, CLNN, clCaffe? #69