naibaf7 / caffe

Caffe: a fast open framework for deep learning. With OpenCL and CUDA support.
http://caffe.berkeleyvision.org/
Other
85 stars 20 forks source link

OpenCL counterpart of cuDNN #34

Open dagamayank opened 8 years ago

dagamayank commented 8 years ago

I came across your post on the Tensorflow thread that you are developing an OpenCL counterpart for cuDNN. I would like to help/contribute on that project. Let me know where and how can I help. I have extensive OpenCL programming experience and am currently focused on ML activities at AMD.

naibaf7 commented 8 years ago

@dagamayank Thank you, help is very welcome, especially from AMD :) To start, you can have a look at how the kernels are generated and the public interface of the cuDNN replacement: https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp https://github.com/naibaf7/caffe/blob/master/include/caffe/greentea/libdnn.hpp

I can also provide you example kernel strings if you don't want to look at that part of the code and are only interested in providing help on optimizing the kernels for AMD GPUs, which would also be very welcome.

bhack commented 8 years ago

@naibaf7 Have you seen last updates on Tensorflow thread?

naibaf7 commented 8 years ago

@bhack Yes, why? :)

bhack commented 8 years ago

Cause I think that your work could fit fine in https://docs.google.com/spreadsheets/d/1YbHn7dAFPPG_PgTtgCJlWhMGorUPYsF681TsZ4Y4LP0/edit?usp=sharing

dagamayank commented 8 years ago

@naibaf7 Kernel strings would be great to have. Also, if you can provide some steps on how to get started that would be great.

On Wed, May 25, 2016 at 2:24 AM, Fabian Tschopp notifications@github.com wrote:

@dagamayank https://github.com/dagamayank Thank you, help is very welcome, especially from AMD :) To start, you can have a look at how the kernels are generated and the public interface of the cuDNN replacement: https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp

https://github.com/naibaf7/caffe/blob/master/include/caffe/greentea/libdnn.hpp

I can also provide you example kernel strings if you don't want to look at that part of the code and are only interested in providing help on optimizing the kernels for AMD GPUs, which would also be very welcome.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/naibaf7/caffe/issues/34#issuecomment-221493984

Mayank Daga "Nothing Succeeds Like Success"

naibaf7 commented 8 years ago

@dagamayank Ok, the easiest way to get started is to compile Caffe with the USE_LIBDNN turned on in the Makefile.config (https://github.com/naibaf7/caffe/blob/master/Makefile.config.example#L15). Then, if you want to get a kernel string to look for optimization purposes, uncomment this line:

  ss << generate_bw_defs();
  ss << generate_bw_kernels("conv_backward");
  ss << generate_wg_defs();
  ss << generate_wg_kernels("conv_weights");

  // Write complete kernel string
  kernel_ = ss.str();

  // std::cout << kernel_ << std::endl;
}

(it's line https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp#L1588)

This will give you the kernel string in std::cout to examine it for example in AMD's GPU Open CodeXL. Every kernel string will consist of 3 main kernels: conv_forward, conv_backward and conv_weights. For conv_backward and conv_weights, there are 2 different algorithms each that can be selected:

typedef enum {
  // Stack the batch update into one GEMM block
  // (deterministic, 1 kernel call)
  // Serializes the batch and may therefore under use
  // the GPUs compute units.
  LIBDNN_CONVOLUTION_WG_ALGO_DIRECT        = 0,
  // Use multiple GEMM blocks in parallel and update weights atomically
  // (non deterministic, 1 kernel call, not supported on all devices)
  // Parallelizes the batch and has therefore higher GPU usage.
  LIBDNN_CONVOLUTION_WG_ALGO_ATOMIC        = 1,
  // Use multiple GEMM blocks and an intermediate buffer
  // to reduce weight updates
  // (deterministic, >= 2 kernel calls)
  // Parallelizes the batch and has therefore higher GPU usage.
  // NOT IMPLEMENTED YET
  LIBDNN_CONVOLUTION_WG_ALGO_REDUCTION     = 2
} libdnnConvolutionWeightAlgo_t;

typedef enum {
  // Transform data before GEMM (load, im2col, gemm, store)
  // This method is suitable for convolutions with similar
  // spatial input == output sizes, but can become inefficient
  // if input >> output (with large strides and kernels).
  LIBDNN_CONVOLUTION_BW_ALGO_IM2COL        = 0,
  // Transform data after GEMM (load, gemm, col2im, store)
  // Sometimes faster than im2col method, but uses
  // atomic operations and is not deterministic.
  LIBDNN_CONVOLUTION_BW_ALGO_COL2IM_ATOMIC = 1
} libdnnConvolutionBackwardAlgo_t;

which one is being used can be changed here: https://github.com/naibaf7/caffe/blob/master/src/caffe/layers/libdnn_conv_layer.cpp#L63

Finally, you need to run a network in order to instantiate the layers and get some kernel strings. The recommended starting point for that is using the following command:

./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=0 -iterations=5

Together with the instructions above, you can dump the kernel strings to a text file like that, and look for optimization possibilities that way. Note that every convolution layer gets its own set of kernels, so the above command will give you many different ones.

dagamayank commented 8 years ago

@naibaf7 Thanks a lot for these instructions. I will give them a try and report back.

dagamayank commented 8 years ago

I get failure errors on running "make runtest" on the code in master branch of your repo. Is this expected? Two of the errors are from libDNN. My development environment is AMD W9100 and Ubuntu 14.04.

[----------] Global test environment tear-down [==========] 2028 tests from 274 test cases ran. (3614992 ms total) [ PASSED ] 2013 tests. [ FAILED ] 15 tests, listed below: [ FAILED ] NetTest/0.TestSharedWeightsUpdate, where TypeParam = caffe::CPUDevice [ FAILED ] LibDNNComparativeTest/0.TestBackward, where TypeParam = float [ FAILED ] LibDNNComparativeTest/1.TestBackward, where TypeParam = double [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial11x11x1x2_caffenet_Conv1, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x1_caffenet_Conv4, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestGradient_Spatial, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x1_caffenet_Conv3, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x2_caffenet_Conv5, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial5x5x1x2_caffenet_Conv2, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.Test1x1Convolution_Spatial, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.Test1x1Gradient_Spatial, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3xPad1, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial5x5, where TypeParam = caffe::GPUDevice

naibaf7 commented 8 years ago

@dagamayank TestSharedWeightsUpdate seems to fail by being off by a small margin. This is weird but can be ignored and is not relevant for this implementation.

The _Spatial failures are from Intel's convolution implementation. I think the fix here is to use the latest ViennaCL development branch: https://github.com/viennacl/viennacl-dev instead of what Ubuntu supplies.

As for the libDNN, this test should definitely not fail. Here it would be helpful to get the failure message from the runtest itself (i.e. where the runtest on libdnn aborted. You can test this in detail by using: ./build/test/test_all.testbin --gtest_filter=*LibDNN*Comparative*Backward* 0

dagamayank commented 8 years ago

@naibaf7 Well, I do not clearly understand the output; there are a bunch of lines with values but the last few lines are - Error count: 134841/159600 Difference: 3.17333e+06 (value: 2.30564e+06 vs 2.2954e+06) src/caffe/test/test_libdnn_conv.cpp:1064: Failure Value of: false Expected: failure Which is: true [ FAILED ] LibDNNComparativeTest/1.TestBackward, where TypeParam = double (11638 ms) [----------] 1 test from LibDNNComparativeTest/1 (11638 ms total)

[----------] Global test environment tear-down [==========] 2 tests from 2 test cases ran. (37154 ms total) [ PASSED ] 0 tests. [ FAILED ] 2 tests, listed below: [ FAILED ] LibDNNComparativeTest/0.TestBackward, where TypeParam = float [ FAILED ] LibDNNComparativeTest/1.TestBackward, where TypeParam = double

naibaf7 commented 8 years ago

@dagamayank I just verified on my W9100 that the backward pass is fine. What driver are you using? I'm using 15.302 (Crimson Edition 15.12 Linux 64 bit). I had problems with the old FirePro driver, so I switched to the Radeon driver.

Do you have any other OpenCL device to check if the backward pass passes the test?

dagamayank commented 8 years ago

@naibaf7 Yes, it is probably the old Firepro driver. If it works on your end with the newer driver, I think we can call it a no-issue for now.

I am going through the kernels right now. Can you mention the reason for random values to the #defines? It will take sometime for me to understand what you are doing there.

On Wed, Jun 1, 2016 at 2:36 AM, Fabian Tschopp notifications@github.com wrote:

@dagamayank https://github.com/dagamayank I just verified on my W9100 that the backward pass is fine. What driver are you using? I'm using 15.302 (Crimson Edition 15.12 Linux 64 bit). I had problems with the old FirePro driver, so I switched to the Radeon driver.

Do you have any other OpenCL device to check if the backward pass passes the test?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/naibaf7/caffe/issues/34#issuecomment-222916072, or mute the thread https://github.com/notifications/unsubscribe/AIdLMgIUKLvebfxKvpJv2y3FvITbTpxPks5qHTaWgaJpZM4ImHlA .

Mayank Daga "Nothing Succeeds Like Success"

naibaf7 commented 8 years ago

@dagamayank The defines are defining constants for the kernel, such as padding (v_p), striding (v_s), dilation (v_d) and image sizes (v_imsi, v_imso) in each dimension. Other defines are for the GEMM core configuration (such as TSK, TSM, TSN, WPTM, WPTN, ...)

I put it into defines rather than directly into the kernel string for better readability of the kernel itself (i.e. easier to see where a constant is used and why). As for documentation, all the values are explained in: https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp (look for add_def, which is the C++ method I use for declaring new kernel #defines).

dagamayank commented 8 years ago

@naibaf7

Are you using autotuning to generate the values of those constants? In other words, will the constants be same for different kernels and for different networks?

On Wed, Jun 1, 2016 at 9:01 AM, Fabian Tschopp notifications@github.com wrote:

@dagamayank https://github.com/dagamayank The defines are defining constants for the kernel, such as padding (v_p), striding (v_s), dilation (v_d) and image sizes (v_imsi, v_imso) in each dimension. Other defines are for the GEMM core configuration (such as TSK, TSM, TSN, WPTM, WPTN, ...)

I put it into defines rather than directly into the kernel string for better readability of the kernel itself (i.e. easier to see where a constant is used and why). As for documentation, all the values are explained in: https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp (look for add_def, which is the C++ method I use for declaring new kernel

defines).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/naibaf7/caffe/issues/34#issuecomment-223001639, or mute the thread https://github.com/notifications/unsubscribe/AIdLMryIrMypGnQtyJCAj583knY-8qvOks5qHZDTgaJpZM4ImHlA .

Mayank Daga "Nothing Succeeds Like Success"

naibaf7 commented 8 years ago

@dagamayank Some of the values can be autotuned (such as WPTM, WPTN), others are defined by the convolution settings (such as v_p, v_s, v_d). However the autotuner can't store the tuning results yet, so that's experimental. That means values such as WPTM, WPTN will be the same for every kernel/network at the moment, while v_p, v_s, v_d depends on what kind of convolution you choose (3x3 unpadded, 11x11 with stride, etc.) the image input/output sizes (v_imsi, v_imso) obviously depend on how big the image/feature maps are in the network.

I hope that helps.

naibaf7 commented 8 years ago

@dagamayank Have you made any progress on this or is something too complicated?

dagamayank commented 8 years ago

@naibaf7 I did not get a chance to work on it yet. Working on some internal fires now but I will soon get to it. Auto-generated kernels are not the most simplest ones to understand :)

naibaf7 commented 8 years ago

@dagamayank I understand. I will work on the project this weekend and hopefully have some improvements until monday. One interesting thing I found is that I'm better off targeting TLP instead of ILP on the AMD W9100, i.e. take care not to use too many VGPRS on the AMD card (to get >= 4 waves in flight). On the nVidia card (GTX 980) it was better to push for high ILP (use more #pragma unroll) and relax on occupancy/TLP. Would be interested what your opinion on this is, and if I am right with these assumptions...

Using vectors of size 4 and 16x16 thread blocks (64x64xTSK shared memory tiling) seems to work best on both cards so far though.

dagamayank commented 8 years ago

@naibaf7 In my experience using fewer registers is generally a better choice on AMD GPUs. This allows improved occupancy as well as lets the compiler to generate better code.

One question I had was - do I have to run the entire Alexnet or can I just run the 1st convolution layer using cifar10? What kind of performance are you seeing right now?

naibaf7 commented 8 years ago

@dagamayank You can remove the layers after the 1st convolution in the prototxt file, or start with any other convolution as long as you have the input data defined & connected correctly. However the first convolution is usually not the most interesting as it has only a few input feature maps. Performance wise, on AlexNet forward pass I see these numbers (batch size 64): (These are all untuned in default configuration, so there should be plenty of headroom)

Especially the clBLAS forward performance is extremely detrimental, which was my main motivation to create libDNN. At this stage, libDNN beats cuBLAS-based implementations. The goal is to get within 70-80% of cuDNN.

naibaf7 commented 8 years ago

@dagamayank LibDNN is now available as a standalone library: https://github.com/naibaf7/libdnn

zazd commented 8 years ago

@naibaf7 I am very interesting in the LibDNN. It gets a good capability. For I am not familiar with opencl , I just glance over the LibDNN, it seems that it is also using matrix multiplication. If possibly, would your tell me if it is principle same to with cudnn? or so nice as you can provide me the references such as paper or document. Thank you.

naibaf7 commented 8 years ago

@zazd Yes it uses a local-memory and register-level GEMM. It is similar to cuDNN, you can read up more here: https://arxiv.org/pdf/1410.0759.pdf

naibaf7 commented 7 years ago

@bhack @gstoner Good news for the RX 480: Performance issues and thermal envelope crashes have been completely fixed since Linux kernel 4.8 AMDGPU drivers. It is now possible to use the RX 480 for deep learning without limitations on any Linux :)

With LibDNN on both the GTX 1080 and RX 480, the RX 480 performs exactly half as fast as the GTX 1080, just like expected.

bhack commented 7 years ago

Do you have v2 kernels?

naibaf7 commented 7 years ago

@bhack For the external library I did not port them yet... Quite busy with a new project at the moment regarding sparse RNN's. :) Let me know if you need something though. This was just a heads up because the RX 480 did not work well at all for the past 3 months.

bhack commented 7 years ago

@naibaf7 It is hard to talk about this topic.. We actually are the only one that use libdnn as upstream :wrink:. It could be nice if caffe could use libdnn as upstream naturally instead of having libdnn downstream. /cc @edgarriba

naibaf7 commented 7 years ago

@bhack Yeah last week, Codeplay's CEO contacted me regarding some stuff in OpenCL TensorFlow. If he expresses interest as well, I will definitely re-focus more on the libdnn standalone. But I haven't heard back (yet).

bhack commented 7 years ago

I think also that @hughperkins could be interested to the standalone upstream

dagamayank commented 7 years ago

@naibaf7 do you have Winograd kernels in libDNN?

naibaf7 commented 7 years ago

@dagamayank No not yet...

bhack commented 7 years ago

Could be interesting if @dicecco1 would contribute upstream on libdnn standalone

dicecco1 commented 7 years ago

I'd be interested in being involved in this, though the way that OpenCL is used with FPGAs has some differences/conflicts with the current way that greentea has been setup.

Currently compile time for kernels is on the order of hours for FPGA implementations, so they use offline compilation and program the FPGA with the binary (this still takes on the order of 300-400ms), so between kernels there has to be little or no reprogramming.

bhack commented 7 years ago

So it is pratically impossibile to have an autotuning approach like libdnn. Right?

edgarriba commented 7 years ago

Apart from that I think it's quite straightforward to provide a couple of interfaces for offline building and import built binaries. Is that right @naibaf7?

dicecco1 commented 7 years ago

Yeah, essentially for the FPGA implementations you need to decide more on an architecture (since in FPGAs you're configuring circuits rather than processing instructions) and it is usually best to have something that is either general (e.g. can handle different sizes/strides) or is very specific to a model (e.g. tuned to be very high performance for AlexNet). Autotuning for different layers would fit more into the model specific approach to FPGA implementations, but this would still be offline.

bhack commented 7 years ago

@dicecco1 I have not checked in detail your paper but your Winograd kernel could be ported also on GPU/CPU or need to be heavily reeinginered?

dicecco1 commented 7 years ago

The winograd kernel would need to be heavily re-engineered for CPU/GPU implementations.

bhack commented 7 years ago

I don't know if also @keryell is interested in dicecco1 kernels

bhack commented 7 years ago

For all in the thread I'm talking of https://github.com/dicecco1/fpga_caffe

naibaf7 commented 7 years ago

There certainly are ways to either cache or tune the kernels on a surrogate platform. The key here would be to know the FPGA's details and make educated guesses about the performance instead of tuning directly on the FPGA.

naibaf7 commented 7 years ago

@bhack @dicecco1 The issue of having to massively re-engineer winograd kernels to fit to new platforms has been noticed by the developers of NEON/Nervanasys as well as @hughperkins. There's good reasons Nervanasys has built specific compilers for Maxwell/Pascal. The architectural differences are even bigger when going to AMD; VGPRS usage has to be kept in check, and the constant buffers/local memory has to be optimized differently. Local memory is bigger on Maxwell/Pascal than on Polaris/Hawaii, and the cache system works completely different (AMD has 64 KB constant buffers, nVidia uses a read-through/write-through configurable caching system).

bhack commented 7 years ago

@naibaf7 Can you notify us if you have some feedback of others interested to have v3 kernels and libdnn standalone as upstream?

naibaf7 commented 7 years ago

@bhack Yes. Still waiting on feedback here :)

hughperkins commented 7 years ago

Observation: I'm still waiting on an example of calling libdnn from C++ :-)

bhack commented 7 years ago

You can seen an example with tuning commented at https://github.com/tiny-dnn/tiny-dnn/blob/master/tiny_dnn/core/kernels/conv2d_op_libdnn.h

bhack commented 7 years ago

@naibaf7 ok please give us an update as you can cause the standalone version it is quite on hold.

naibaf7 commented 7 years ago

@bhack Yes, quite unfortunately, since I'm working hard on my semester project (sparse repeated pattern recurrent neural networks); unfortunately my university does not accredit my work on Caffe :) The current timeline is as follows:

naibaf7 commented 7 years ago

Status update, non-atomic backward kernels for pooling finished, library unit tested & verified with style-transfer and MNIST examples. Next step: Standalone LibDNN update by end of december (latest).

bhack commented 7 years ago

Latest? Is it the end of the project?