Increase Efficiency by decreasing unnecessary copying of data between CPU RAM and VRAM

thornhale commented 7 years ago

In response to discussions here: https://github.com/benoitsteiner/tensorflow-opencl/issues/65#issuecomment-297412974

I am posting this issue.

Unnecessary copying of data from RAM to VRAM can reduce performance - especially in a bandwidth limited system. I am opening up this issue as a discussion point on what can be done to reduce this. I have absolutely no idea what can be done. But if there are clear points, perhaps people can join in (including myself, though I am mostly a Python data scientist).

lukeiwanski commented 7 years ago

Hi @thornhale

Thanks for opening the issue. We are aware of that. There are two causes of host to device memory transfers.

The first occurs when graph node locality is fragmented ( some of the nodes are registered for SYCL / GPU but the neighbours nodes are not ).

For instance, lets take a look at very simple graph that one might create:

A = B * C + D

Above translates to a graph which might looks this:

     ( Assign Op ) =
      /         \ 
A ( Tensor )       ( Cwise Add Op ) +
                    /            \ 
          D ( Tensor )         ( Cwise Mul Op ) *
                                /            \
                      B ( Tensor )        C ( Tensor )

TensorFlow will allocate and populate all the tensors for that device. For a GPU ( CUDA and SYCL ) there still will be a host memory tensor too.

Heavy calculation of B * C + D will be offloaded to the registered accelerator. Unavoidable data movement to that accelerator at the beginning and from that accelerator at the end will occur.

There should be no more memory transfers - since all the Ops are registered for that accelerator.

That wouldn't be the case if one of the nodes implementation was not registered for the accelerator, and in that case additional memory transfers will occur.

For instance, in the case where the Add Op was registered for the CPU/host and the Mul Op was registered for the GPU/accelerator, one would have to copy the data to the accelerator for the Mul Op. It would then need to send it back to the host for the Add Op and then send it back again to the accelerator so that the Assign Op can be performed there.

That occurs due to the simple fact that we have not yet implemented all the features used in modern models (we are getting there...). For instance @ville-k is working on the Pooling operation and I am working on Conv. Both of these are heavily used in most of the models that we have come across.

The nodes used will greatly depend on the model used, for instance, I saw the L2Loss Op in ResNet and at present this operation is not implemented.

To get a better view on what can be improved / is missing in your model you can run:

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

And analyse the output - SYCL (OpenCL) registered nodes are clearly indicated.

The output will look something like this:

Sub: (Sub): /job:localhost/replica:0/task:0/device:SYCL:0
Mul: (Mul): /job:localhost/replica:0/task:0/device:SYCL:0
conv/Conv2D: (Conv2D): /job:localhost/replica:0/task:0/cpu:0
conv/batchnorm: (BatchNormWithGlobalNormalization): /job:localhost/replica:0/task:0/device:SYCL:0

We are using a Google spreadsheet to track the progress of the implementation (we will be updating it as we have recently implemented a bunch of things ): https://docs.google.com/spreadsheets/d/1YbHn7dAFPPG_PgTtgCJlWhMGorUPYsF681TsZ4Y4LP0/edit#gid=1719702219

As well as this there is an active pull request ( https://github.com/tensorflow/tensorflow/pull/9117 ) that improves the implementation a lot and we are hoping will represent the initial baseline.

The second performance hit we are taking is quite tricky. It is to do with the way that OpenCL deals with the data on the accelerator, which is different to the way CUDA deals with the same data.

The key difference between OpenCL and CUDA is in the way that the accelerator data is accessed:

CUDA uses a raw pointer. CudaMalloc returns you a device pointer and that pointer is valid only on the accelerator, any dereference attempt on the host will SEGFAULT ( https://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cuda-doc/html/group__CUDART__MEMORY_gc63ffd93e344b939d6399199d8b12fef.html )
OpenCL uses a concept of memobjects, buffers and accessors ( https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html )

So CUDA can allocate memory on the GPU, get a pointer to allocated device memory and pass it to the Eigen evaluator that will handle it correctly due to C++ template specialization magic.

You can see that in the Constant Op implementation: https://github.com/lukeiwanski/tensorflow/blob/master/tensorflow/core/kernels/constant_op_gpu.cu.cc#L74

Const Op assigns a constant value to the tensor equivalent of: A = 1.2f

If you follow the code linked above, you will see that nullaryExpr is used. That NullaryExpr is passed to the Eigen evaluator with the custom functor defined. That functor contains a device valid pointer and that pointer will be de-referenced only on the CUDA GPU ( the pointer will be valid at the execution time ).

Due to the way that OpenCL 1.2 handles memory we cannot do that, we need to use accessors on which we cannot perform the pointer arithmetic. In OpenCL you also need to specify where the data is located on the GPU ( Constant, Privat, Local, Global memory ).

We are aware of both issues and we are actively working on solutions to the above because it can cause a performance impact.

If anyone is keen on helping with improving the performance of the first point I mentioned, feature completion for the models should be what to focus on and the work required to achieve feature complete state can be easily defined and divided ( see the spreasheet mentioned earlier ). The second issue requires some deeper digging and agreements/compromise between the OpenCL/SYCL standard and TensorFlow design/implementation.

I hope that helps. Please let me know if you would like to help in any way, I am more than happy to help get you set up to work on the project.

thornhale commented 7 years ago

Actually, a deeper understanding of nn and deep learning can be obtained by implementing code yourself. I would be interested to reimplement some functions into openCl 1.2. Please let me know:

when the list is updated. (I don't want to duplicate efforts.)
how I can get setup meanwhile. Some pointers on how to get started in opencl 1.2 would be appreciated as well.

On the pointer issue, based on my superficial understanding of opencl 2, is this not what can be done? Data can be passed by passing pointers from device to device, and this includes both simple data types and also more complex arrangements like linked lists.

But perhaps porting code from 1.2 to 2.2 is more feasable once the feature set is actually complete. On support of tensorflow-opencl, my thought is:

To get started, I could help with a few "simple" math operations if they have not been done already.
Then I would be interested in some of the deep learning functions. Conv2d or so...

lukeiwanski commented 7 years ago

If you are interested in re-implementing functionality into plain OpenCL 1.2 I would suggest to look at the Eigen code first. TensorFlow is closely integrated with it and most of the heavy math goes through there (https://bitbucket.org/eigen/eigen/). TensorFlow related code is in the eigen/unsupported/Eigen/CXX11/src/Tensor folder.

It is worth mentioning that Eigen is ported to CUDA which is a slightly higher level programming model than plain OpenCL ( It is more or less the same level as SYCL, that's why we are using it ).

I am not even sure how one would approach implementing plain OpenCL 1.2 as an Eigen back-end. The task seems like it would be quite complicated. Eigen is using C++11 a lot and OpenCL kernels are C. so there is a problem there to start with.

But, perhaps you could write a series of OpenCL C kernels and then call them from either Eigen evaluator or TensorFlow operation ( somehow ). CUDA and SYCL are a single source programming models that allow a subset of C++ in the kernel code. That makes porting easier and quicker. I am not saying that the plain OpenCL approach is impossible, but it will take a lot of effort and time. If you are willing to try I am more than happy to provide some guidance.

As of OpenCL 2.2 - true, it is more feature complete and some of the things are easier, but limits number of the devices that can be targeted, our focus is on mobile and embedded platforms and I am not able to see OpenCL 2.2 there anytime soon. Other than that OpenCL 2.2 allows C++ in the kernel code, but not in the kernel signature, and that is problematic.

To get started I would suggest following these instructions: http://deep-beta.co.uk/setting-up-tensorflow-with-opencl-using-sycl/ - For plain OpenCL you might want to skip the SYCL related steps.

As for OpenCL 1.2 tutorials - There are plenty of materials online. Link to official Khronos spec ( https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf ) . Books wise I liked this one https://www.amazon.co.uk/Heterogeneous-Computing-OpenCL-Revised-1-2-ebook/dp/B00AKFSM14/ref=sr_1_1?ie=UTF8&qid=1493630267&sr=8-1&keywords=opencl+1.2+programming.

As for the "simple" math operations to start with, I would suggest at looking into cwise operations in the tensorflow/core/kernels folder.

I am looking at Conv2d just now and I will keep you updated on the status. I am planning on updating the implementation list properly once the current OpenCL improvements pull request is finished.

thornhale commented 7 years ago

I may have misunderstood. What I meant to say is: If/What are the gaps in the current tensorflow-opencl implementation? I would like to see how I can contribute toward that goal. I am not sure what is the best way forward.

lukeiwanski commented 7 years ago

I am working on creation of project thing in here.. it will contain a small task that need to be fixed ( failing tests ) so people interested in contributing can jump in and start helping. I should have fair bit done by tomorrow.

Will that be helpful?

lukeiwanski commented 7 years ago

@thornhale just small update:

We are still in progress of performance improvement for AMD and mobile targets with focus on models like Resnet50 and VGG.

Every commit in the upstream brings new features that we need to cover but at this stage you should be able to run most of the commonly used networks.

If something won't work let us know.

lukeiwanski / tensorflow

Increase Efficiency by decreasing unnecessary copying of data between CPU RAM and VRAM #7