hughperkins / DeepCL

OpenCL library to train deep convolutional neural networks
Mozilla Public License 2.0
867 stars 199 forks source link

Try using unroll+clblas GEMM #16

Closed hughperkins closed 9 years ago

hughperkins commented 9 years ago

Following this article, http://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/ http://www.reddit.com/r/MachineLearning/comments/338lfs/why_gemm_is_at_the_heart_of_deep_learning/ , decided should try this, in case gives an easy way to speed up DeepCL for large image sizes.

My verdict? Not useful :-(

Tried on my laptop, and on a K520, and the results were:

For batchsize=128, inputplanes=32, inputsize=128, numfilters=32, filtersize=5, on a K520 got:

Matrices are apparently a bit too big for unroll + clblas, so tried using a smaller batchsize: batchsize=16, inputplanes=32, inputsize=128, numfilters=32, filtersize=5:

Note that propagate1 is DeepCL's most generic, least optimized kernel. It doesnt use local memory (which is why it's generic, and works on anything really, unless it runs out of gpu global memory). Kernels using local memory are around 3-10 times faster than propagate1.

Overall: current conclusion: unroll + clblas GEMM doesnt seem promising?

=> closing issue.