hughperkins / tf-coriander

OpenCL 1.2 implementation for Tensorflow
Apache License 2.0
791 stars 90 forks source link

coriander failed to generate opencl #93

Closed pint1022 closed 5 years ago

pint1022 commented 5 years ago

I worked on cudnn implementation in tf-coriander. The codes are modified to be compiled successfully. In testing, there are two failure cases in opencl generation by coriander:

  1. in conv_ops_gpu_3.cu.cc, function 'SwapDimension1And2InTensor3UsingTiles' is not generated corrected. The failedkernel.cl is attached. My guess is that coriander doesn't find the kernel embedded in another kernel.

  2. the above error is occurred in the function call 'functor::NCHWToNHWC<GPUDevice, T, 4>()' in conv_ops.cc:585. after commenting this line out, tf-coriander runs into the 2nd error as the following: " basicblockdumper.runGeneration got exception whilst processing: %39 = tail call i32 @_Z3minii(i32 %38, i32 %2) #6

generateOpenCL failed to generate opencl sourcecode kernel name orig=_ZN10tensorflow12_GLOBAL__N_118MaxPoolForwardNHWCIfEEviPKT_iiiiiiiiiiiPS2_Px kernel name short=_ZN10tensorflow12_GL kernel name unique=_ZN10tensorflow12_GLOBAL__N_118MaxPoolForwardNHWCIfEEviPKT_iiiiiiiiiiiPS2_Px_1_1_2

This kernel is 'void tensorflow::(anonymous namespace)::MaxPoolForwardNHWC(int, float const, int, int, int, int, int, int, int, int, int, int, int, float, long long*)"

At this point, coriander seems not easy to be integrated with Tensorflow to support OpenCL-over-Cuda translation. Those kernels have to be manually created in order to run Tensorflow dnn/rnn over OpenCL device.

please let me know if there is anything overlooked in the above analysis.

easycl-failedkernel.cl.txt

hughperkins commented 5 years ago

Those kernels have to be manually created in order to run Tensorflow dnn/rnn over OpenCL device

I would think that the correct approach wouldbe to modify coriander to be able to handle the compilation issues you encountered. I am unaware of any theoretical limitations that would prevent this?

On Thu, Dec 20, 2018, 00:22 Steven Wang notifications@github.com wrote:

I worked on cudnn implementation in tf-coriander. The codes are modified to be compiled successfully. In testing, there are two failure cases in opencl generation by coriander:

1.

in conv_ops_gpu_3.cu.cc, function 'SwapDimension1And2InTensor3UsingTiles' is not generated corrected. The failedkernel.cl is attached. My guess is that coriander doesn't find the kernel embedded in another kernel. 2.

the above error is occurred in the function call 'functor::NCHWToNHWC<GPUDevice, T, 4>()' in conv_ops.cc:585. after commenting this line out, tf-coriander runs into the 2nd error as the following: " basicblockdumper.runGeneration got exception whilst processing: %39 = tail call i32 @_Z3minii(i32 %38, i32 %2) #6 https://github.com/hughperkins/tf-coriander/issues/6

generateOpenCL failed to generate opencl sourcecode kernel name orig=_ZN10tensorflow12_GLOBAL__N_118MaxPoolForwardNHWCIfEEviPKT_iiiiiiiiiiiPS2_Px kernel name short=_ZN10tensorflow12_GL kernel name unique=_ZN10tensorflow12_GLOBAL__N_118MaxPoolForwardNHWCIfEEviPKT_iiiiiiiiiiiPS2_Px_1_1_2

This kernel is 'void tensorflow::(anonymous namespace)::MaxPoolForwardNHWC(int, float const, int, int, int, int, int, int, int, int, int, int, int, float, long long*)"

At this point, coriander seems not easy to be integrated with Tensorflow to support OpenCL-over-Cuda translation. Those kernels have to be manually created in order to run Tensorflow dnn/rnn over OpenCL device.

please let me know if there is anything overlooked in the above analysis.

easycl-failedkernel.cl.txt https://github.com/hughperkins/tf-coriander/files/2697486/easycl-failedkernel.cl.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hughperkins/tf-coriander/issues/93, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHiqDN3GQkdfQgwvr8T_2PcHqjmZ6wTks5u6x6VgaJpZM4ZbiAF .

pint1022 commented 5 years ago

Solved the 2nd issue on '_Z3minii'.

The 1st issue is harder. The kernel uses 'share'; codes are like : ...

template <typename T, int TileSize> global void SwapDimension1And2InTensor3UsingTiles(const T input, Dimension<3> input_dims, T output) { // One extra line in the inner dimension to avoid share memory bank conflict. shared T shared_memory_tile[TileSize][TileSize + 1]; ... It is translated into 'tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<float, 32>(float const, tensorflow::functor::Dimension<3>, float)::shared_memory_tile'. My guess is that 'clang' doesn't parse it correctly. do you have any insight on this?

hughperkins commented 5 years ago

shared is denoted by the address space. The address spaces get propagated through the things they touch. Unfortunately, this is a little tricky, suffers from the 'halting problem' in a sense, since you can't really tell how it will propagate without working your way through all of the instructions.

It sort of works with a few heuristics/hacks, for 99% of code I ran. I'm not sure if your own code falls into that 99%. For the other 1%, there will need to be a fairly large rewrite of how address space propagation works. I guess you might not be encountering code in that 1%, but here is some doc I wrote about how to handle that 1%, just in case you are encountering that issue: https://github.com/hughperkins/coriander/blob/multiple-walks/doc/walking.md

Part of me thinks that getting this working might be worthy of a paper, part of me assumes that CUDA must have already solved this issue. Unclear.

hughperkins commented 5 years ago

(oh right, so I remember now: address space propagation itself is ok. It's pointers to address space, and pointers to pointers to address space that get a bit hairy. But since you have an array, of arrays, that sounds like a pointer to pointer, so you might be hitting this issue potentially (I haven't looked closely enough at your code samples to check)).

(Edit: the reason pointers to pointers get hairy is because the pointers themselves have to be stored somewhere. That is, the pointers have an address space. So you can have a local pointer to global address space, etc. Coriander kind of hackily makes some assumptions about the types of such pointers, rather than propagating the address-space of each level of indirection independently).

pint1022 commented 5 years ago

I have some clue of the issue, the line in .cl file below:

local float[33] _ZZN10tensorflow7functor37SwapDimension1And2InTensor3UsingTilesIfLi32EEEvPKT_NS0_9DimensionILi3EEEPS2_E18shared_memory_tile[32];

instead, it should be local float _ZZN10tensorflow7functor37SwapDimension1And2InTensor3UsingTilesIfLi32EEEvPKT_NS0_9DimensionILi3EEEPS2_E18shared_memory_tile[32][33];

the mapping line in .ll is like: @_ZZN10tensorflow7functor37SwapDimension1And2InTensor3UsingTilesIfLi32EEEvPKT_NS0_9DimensionILi3EEEPS2_E18shared_memory_tile = linkonce_odr local_unnamed_addr addrspace(3) global [32 x [33 x float]] zeroinitializer, comdat, align 4

I am not sure which one is wrong. In .ll file, '[33 x float]' doesn't seem right which may cause ir-to-opencl confused.

hughperkins commented 5 years ago

I don't know, but Ive never seen a serious error in the llvm bytecode file. Whereas I have seen many bugs in the opencl generated by coriander :). Perhaps a first step could be to check the llvm docs to convince yourself the llvm code is/isn't valid?

On Fri, Dec 21, 2018, 00:16 Steven Wang notifications@github.com wrote:

I have some clue of the issue, the line in .cl file below:

local float[33] _ZZN10tensorflow7functor37SwapDimension1And2InTensor3UsingTilesIfLi32EEEvPKT_NS0_9DimensionILi3EEEPS2_E18shared_memory_tile[32];

instead, it should be local float _ZZN10tensorflow7functor37SwapDimension1And2InTensor3UsingTilesIfLi32EEEvPKT_NS0_9DimensionILi3EEEPS2_E18shared_memory_tile[32][33];

the mapping line in .ll is like: @_ZZN10tensorflow7functor37SwapDimension1And2InTensor3UsingTilesIfLi32EEEvPKT_NS0_9DimensionILi3EEEPS2_E18shared_memory_tile = linkonce_odr local_unnamed_addr addrspace(3) global [32 x [33 x float]] zeroinitializer, comdat, align 4

I am not sure which one is wrong. In .ll file, '[33 x float]' doesn't seem right which may cause ir-to-opencl confused.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/hughperkins/tf-coriander/issues/93#issuecomment-449255681, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHiqPdoDTUlDo79f-IXHqvC6An0WiMVks5u7G7CgaJpZM4ZbiAF .

pint1022 commented 5 years ago

There is a bug in CLWriter.cpp to handle array type. I just fixed the issue.

hughperkins commented 5 years ago

Wooo, cool :) Pull request?

pint1022 commented 5 years ago

Sure, how to do it?

Get Outlook for iOShttps://aka.ms/o0ukef

On Fri, Dec 21, 2018 at 4:03 PM -0800, "Hugh Perkins" notifications@github.com<mailto:notifications@github.com> wrote:

Wooo, cool :) Pull request?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/hughperkins/tf-coriander/issues/93#issuecomment-449527029, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AksTDM3ZCkbbzf_wbXwWRhNok9gtPkrFks5u7Xa2gaJpZM4ZbiAF.