naibaf7 / caffe

Caffe: a fast open framework for deep learning. With OpenCL and CUDA support.
http://caffe.berkeleyvision.org/
Other
85 stars 20 forks source link

CLBlast support #32

Closed psyhtest closed 8 years ago

psyhtest commented 8 years ago

@naibaf7

I've implemented support for Cedric Nugteren's CLBlast library. The 0.6.0 version had a few issues but the most recent 0.7.0 version seems to have addressed them. In addition, 0.7.0 added support for xASUM which helped to keep integration clean.

I've tested this integration on the Samsung Chromebook 2 with the ARM Mali-T628 GPU and version v6.0 of the driver, skipping the known test failures (#28, #29, #30) that are currently opened for that platform.

Please review.

bhack commented 8 years ago

/cc @CNugteren

naibaf7 commented 8 years ago

@psyhtest Nice one, thanks. Will be reviewed over the weekend.

naibaf7 commented 8 years ago

@psyhtest Merged this into my branch for now for people who want to test cutting-edge. I'll do some cleanup and make lint corrections before pushing it to the BVLC repository.

psyhtest commented 8 years ago

@naibaf7 Thanks!

You may notice that in blocks dispatching calls into CLBlast I use different formatting and explicitly define some constants (e.g. incX, offY). I believe it would be beneficial for clBLAS code too, as this would make it more readable, but I understand if you need to follow an established Caffe style.

Another thing is that even similar code blocks use different styles e.g.

        clblast::Scal<float>(
          N, // uppercase
          alpha,
          x, offx, incx, // all lowercase
          &queue
        )
        clblast::Asum<float>(
          n, // lowercase
          Z, offZ,
          X, offX, incX, // uppercase X, mixed case offX and incX
          &queue
        )
bhack commented 8 years ago

@naibaf7 How this is different from autotuining code that you are writing.

naibaf7 commented 8 years ago

@bhack Greenea-LibDNN autotuning you mean? There I attempt to autotune a fused kernel that does not need an intermediate convolution buffer. CLBlast is an autotuned BLAS that can be tested against ViennaCL and clBLAS for regular GEMM convolutions. And of course a BLAS is also needed for other auxiliary operations in the network.