hughperkins / tf-coriander

OpenCL 1.2 implementation for Tensorflow
Apache License 2.0
791 stars 90 forks source link

[Question] Why did you choose this approach? #62

Open iame6162013 opened 7 years ago

iame6162013 commented 7 years ago

Firstly, I want to say that all the work you've done is awesome and that I'm dying to get this to work on arch. :)

But why did you choose to port tensorflow to use coriander instead of trying to make coriander into a cuda compliant library?

hughperkins commented 7 years ago

I'm not sure I understand really

Dexdev08 commented 7 years ago

I think the question is this - why is tf coriander implementing tensorflow instead of cuda. On Mon, 28 Aug 2017 at 17:12, Hugh Perkins notifications@github.com wrote:

I'm not sure I understand really

  • what do you mean by "cuda compliant library"?
  • in what sense do you feel that Coriander is not what you call a "cuda compliant library"?
  • what would you do differently, if you were to start from scratch?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hughperkins/tf-coriander/issues/62#issuecomment-325302884, or mute the thread https://github.com/notifications/unsubscribe-auth/AEc4DouyeUooIqbbB9yGJ12H7XIiAAl7ks5scoSCgaJpZM4PD42F .

hughperkins commented 7 years ago

Do you mean, why do we have to modify tensorflow at all, in order to run it on Coriander?

iame6162013 commented 7 years ago
what do you mean by "cuda compliant library"?
in what sense do you feel that Coriander is not what you call a "cuda compliant library"?
what would you do differently, if you were to start from scratch?

Do you mean, why do we have to modify tensorflow at all, in order to run it on Coriander?

Yes, why do we have to modify tensorflow? Won't this be/Isn't this a huge pita?

(I do apologize if I'm misunderstanding the project )

hughperkins commented 7 years ago

Ah. Yes. Not having to modify Tensorflow at all would be better. Harder though :) . What are your thoughts on how we can move more towards this direction?

ghost commented 7 years ago

I think the question here is: Why make a transpiler from CUDA to OpenCL rather than an OpenCL library that implements the CUDA/CUNN API to be linked against by programs expecting CUDA.

I haven't wondered this about Coriander, but I have wondered why nobody seems to have implemented something like this yet.

HIP kinda does this, but the API function names have been changed to hip_whatever rather than cu_whatever. Possibly this was due to the atrocity that was the ruling in Oracle V Google, that APIs can be copyrighted. Or, possibly it was just pride.

It's possible that a shim between CUDA and HIP would be a small project, for someone familiar with both? That was my first thought when looking at HIP: https://rocm.github.io/documentation.html (https://rocm.github.io/documentation.html)

August 28, 2017 10:36 AM, "Hugh Perkins" wrote: Do you mean, why do we have to modify tensorflow at all, in order to run it on Coriander?

You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub (https://github.com/hughperkins/tf-coriander/issues/62#issuecomment-325308035), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABHR3bw4v_plgrhoSTscjvag6QWidHRFks5scooBgaJpZM4PD42F).

hughperkins commented 7 years ago

I think the question here is: Why make a transpiler from CUDA to OpenCL rather than an OpenCL library that implements the CUDA/CUNN API to be linked against by programs expecting CUDA.

Ok. You're saying 'why make something that can compile NVIDIA® CUDA™ code into OpenCL, rather than just implementing an API, so we can run the NVIDIA® CUDA™ directly, without recompiling it?'

Taking a smallish step back, there are two parts to an NVIDIA® CUDA™ program:

iame6162013 commented 7 years ago

For these kernels, the question is, how to get them to run on a non-CUDA, non-PTX, non-SASS GPU?

No, that would probably be extremely difficult, and as far as I know the host program doesn't care about what the kernels look like, just that they do what they're supposed to.

Fortunately amd seems to be pushing for hip support in tensorflow, but it doesn't seem to be going anywhere. Thus a shim between CUDA and HIP would be worthless for the time being.

hughperkins commented 7 years ago

No, that would probably be extremely difficult

It is difficult to get NVIDIA® CUDA™ GPU kernels to run on OpenCL, but that is what Coriander does. Coriander can run NVIDIA® CUDA™ GPU kernels on OpenCL 1.2 GPU hardware, with no modifications to the NVIDIA® CUDA™ source-code.

iame6162013 commented 7 years ago

I think we had a little misunderstanding there, I thought you meant after the default compilation, which coriander doesn't do, I think.

hughperkins commented 7 years ago

I think there are at least 4 misunderstandings here :) . But since I'm one of the people misunderstanding the other people's potential misunderstandings, it is hard for me to clear up, or attempt to clear up, any potential misunderstandings :) . Or perhaps it is only me misunderstanding. Anyway :)

Let's go through:

For Carthal's statement, I think that the host-side API is only half the solution: you still need to somehow do something with the NVIDIA® CUDA™ GPU kernels, so they can run on OpenCL GPUs. Somehow, they need to be converted into something, eg SPIR, or OpenCL, or some proprietary language/bytecode, so they can run on non-CUDA GPUs. Coriander handles this :)

To iame6162013's original question: why do we need to modify Tensorflow? CAnt Coriander run without modifying Tensorflow?

To what extent do thes three assertions clear up some of the questions above?

hughperkins commented 7 years ago

(as far as, what are some obstacles to binary-level compatibility, a huge obvious one is enum values. For example, let's say in the client source-code there's a constant enum usage like:

if(cudaMalloc() == CUDA_OUT_OF_MEMORY) {
...
}

This means that Coriander needs to make sure that cudaMalloc returns CUDA_OUT_OF_MEMORY value. But, what is the value of CUDA_OUT_OF_MEMORY? The client sourcecode doesnt say. This leaves a couple of options that I can see:

https://github.com/hughperkins/coriander/blob/master/include/cocl/cocl_memory.h#L33-L38

enum cudaMemcpyKind {  // name used by thrust, in trivial_copy.inl, line 55 ish
    cudaMemcpyDeviceToHost=111,
    cudaMemcpyHostToDevice=222,
    cudaMemcpyDeviceToDevice=333,
    cudaMemcpyDefault=444  // from thrust, trivial_copy.inl
};

This means we need to at least recompile the client source-code, even if we dont need to actually modify it.

iame6162013 commented 7 years ago

So, if I where to create this host-side API, and compile the kernels with coriander, would I then be capable of ~running~ compiling & running tensorflow without any changes to their source code?

look at the NVIDIA® CUDA™ Toolkit source-code. This seems undesirable, to avoid copyright issues etc

Yes, a clean room design is the preferable.

This means we need to at least recompile the client source-code, even if we dont need to actually modify it.

That shouldn't be a problem for the use cases I care about. :)

hughperkins commented 7 years ago

So, if I where to create this host-side API, and compile the kernels with coriander, would I then be capable of running compiling & running tensorflow without any changes to their source code?

In fact, Coriander already implements much of the NVIDIA® CUDA™ API.

So, you may say, "what was actually changed in Tensorflow, in the tf-coriander repository?". Good question! Mostly it comes down to:

"But!", you say, "Whats all that nonsense in tensorflow/stream_executor/cl?". Well, yes, arguably you could remove that, and simply use the CUDA stream_executor directly. I've no idea how hard/easy that is. Both options seem to have their good/bad points. The current option, of creating a new cl stream_executor has the rather bad downside of making upgrading to new tensorflow versions look scary. On the other hand, since those files wont be modified by tensorflow upstream, they should be invariant across upgrade, approximately? Unclear.