Try using unroll+clblas GEMM

Following this article, http://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/ http://www.reddit.com/r/MachineLearning/comments/338lfs/why_gemm_is_at_the_heart_of_deep_learning/ , decided should try this, in case gives an easy way to speed up DeepCL for large image sizes.

My verdict? Not useful :-(

Tried on my laptop, and on a K520, and the results were:

unroll + matmult on cpu is a bit faster than direct cpu convolution. I suppose this is because memory access patterns a better
unroll + clblas was faster again
the most naive convolutional opencl kernel, ie not using any type of unroll or gemm, was the fastest

For batchsize=128, inputplanes=32, inputsize=128, numfilters=32, filtersize=5, on a K520 got:

convolution + cpu: 318s
unrolled + cpu: 218s
unrolled +clblas: invalid command queue
no unrolling, propagate1: 2s

Matrices are apparently a bit too big for unroll + clblas, so tried using a smaller batchsize: batchsize=16, inputplanes=32, inputsize=128, numfilters=32, filtersize=5:

convolution + cpu: 39s
unroll + cpu: 26s
unroll + clblas GEMM: 2.2s
propagate1: 0.27s

Note that propagate1 is DeepCL's most generic, least optimized kernel. It doesnt use local memory (which is why it's generic, and works on anything really, unless it runs out of gpu global memory). Kernels using local memory are around 3-10 times faster than propagate1.

Overall: current conclusion: unroll + clblas GEMM doesnt seem promising?

=> closing issue.

hughperkins / DeepCL

Try using unroll+clblas GEMM #16