[Question] Why did you choose this approach?

iame6162013 commented 7 years ago

Firstly, I want to say that all the work you've done is awesome and that I'm dying to get this to work on arch. :)

But why did you choose to port tensorflow to use coriander instead of trying to make coriander into a cuda compliant library?

hughperkins commented 7 years ago

I'm not sure I understand really

what do you mean by "cuda compliant library"?
in what sense do you feel that Coriander is not what you call a "cuda compliant library"?
what would you do differently, if you were to start from scratch?

Dexdev08 commented 7 years ago

I think the question is this - why is tf coriander implementing tensorflow instead of cuda. On Mon, 28 Aug 2017 at 17:12, Hugh Perkins notifications@github.com wrote:

I'm not sure I understand really

what do you mean by "cuda compliant library"?

in what sense do you feel that Coriander is not what you call a "cuda compliant library"?

what would you do differently, if you were to start from scratch?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hughperkins/tf-coriander/issues/62#issuecomment-325302884, or mute the thread https://github.com/notifications/unsubscribe-auth/AEc4DouyeUooIqbbB9yGJ12H7XIiAAl7ks5scoSCgaJpZM4PD42F .

hughperkins commented 7 years ago

Do you mean, why do we have to modify tensorflow at all, in order to run it on Coriander?

iame6162013 commented 7 years ago

what do you mean by "cuda compliant library"?
in what sense do you feel that Coriander is not what you call a "cuda compliant library"?
what would you do differently, if you were to start from scratch?

A library where tensorflow can link against without requiring tensorflow to be changed. Something like wine https://www.winehq.org/ .
It requires you to modify tensorflow, updating tensorflow will now become a pita.
Create a more wine like approach.

Do you mean, why do we have to modify tensorflow at all, in order to run it on Coriander?

Yes, why do we have to modify tensorflow? Won't this be/Isn't this a huge pita?

(I do apologize if I'm misunderstanding the project )

hughperkins commented 7 years ago

Ah. Yes. Not having to modify Tensorflow at all would be better. Harder though :) . What are your thoughts on how we can move more towards this direction?

ghost commented 7 years ago

I think the question here is: Why make a transpiler from CUDA to OpenCL rather than an OpenCL library that implements the CUDA/CUNN API to be linked against by programs expecting CUDA.

I haven't wondered this about Coriander, but I have wondered why nobody seems to have implemented something like this yet.

HIP kinda does this, but the API function names have been changed to hip_whatever rather than cu_whatever. Possibly this was due to the atrocity that was the ruling in Oracle V Google, that APIs can be copyrighted. Or, possibly it was just pride.

It's possible that a shim between CUDA and HIP would be a small project, for someone familiar with both? That was my first thought when looking at HIP: https://rocm.github.io/documentation.html (https://rocm.github.io/documentation.html)

August 28, 2017 10:36 AM, "Hugh Perkins" wrote: Do you mean, why do we have to modify tensorflow at all, in order to run it on Coriander?

—

You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub (https://github.com/hughperkins/tf-coriander/issues/62#issuecomment-325308035), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABHR3bw4v_plgrhoSTscjvag6QWidHRFks5scooBgaJpZM4PD42F).

hughperkins commented 7 years ago

I think the question here is: Why make a transpiler from CUDA to OpenCL rather than an OpenCL library that implements the CUDA/CUNN API to be linked against by programs expecting CUDA.

Ok. You're saying 'why make something that can compile NVIDIA® CUDA™ code into OpenCL, rather than just implementing an API, so we can run the NVIDIA® CUDA™ directly, without recompiling it?'

Taking a smallish step back, there are two parts to an NVIDIA® CUDA™ program:

there are the hostside calls, to the NVIDIA® CUDA™ api, to do things like:
- allocate GPU memory
- copy data to and from the GPU
- run NVIDIA® CUDA™ kernels, on the GPU => these bits are handled by Coriander, theoretically without needing changes to the client program source-code
there are the NVIDIA® CUDA™ GPU kernels themselves, which run on the GPU, after compilation into PTX, and then SASS. For these kernels, the question is, how to get them to run on a non-CUDA, non-PTX, non-SASS GPU?

iame6162013 commented 7 years ago

For these kernels, the question is, how to get them to run on a non-CUDA, non-PTX, non-SASS GPU?

No, that would probably be extremely difficult, and as far as I know the host program doesn't care about what the kernels look like, just that they do what they're supposed to.

Fortunately amd seems to be pushing for hip support in tensorflow, but it doesn't seem to be going anywhere. Thus a shim between CUDA and HIP would be worthless for the time being.

hughperkins commented 7 years ago

No, that would probably be extremely difficult

It is difficult to get NVIDIA® CUDA™ GPU kernels to run on OpenCL, but that is what Coriander does. Coriander can run NVIDIA® CUDA™ GPU kernels on OpenCL 1.2 GPU hardware, with no modifications to the NVIDIA® CUDA™ source-code.

iame6162013 commented 7 years ago

I think we had a little misunderstanding there, I thought you meant after the default compilation, which coriander doesn't do, I think.

hughperkins commented 7 years ago

I think there are at least 4 misunderstandings here :) . But since I'm one of the people misunderstanding the other people's potential misunderstandings, it is hard for me to clear up, or attempt to clear up, any potential misunderstandings :) . Or perhaps it is only me misunderstanding. Anyway :)

Let's go through:

For Carthal's statement, I think that the host-side API is only half the solution: you still need to somehow do something with the NVIDIA® CUDA™ GPU kernels, so they can run on OpenCL GPUs. Somehow, they need to be converted into something, eg SPIR, or OpenCL, or some proprietary language/bytecode, so they can run on non-CUDA GPUs. Coriander handles this :)

To iame6162013's original question: why do we need to modify Tensorflow? CAnt Coriander run without modifying Tensorflow?

yes, it's theoretically possible to run Tensorflow, on OpenCL, without modifying Tensorflow sourcecode at all. I've taken a compromise approach for now, of modifying Tensorflow a bit, but not very much. If you can modify tf-coriander, and coriander, to need fewer changes to Tensorflow, or ideally no changes at all, that would be extremely useful :)
no, I'm not aiming for runtime compatibility: you need to at least, at a minimum, recompile the client source-code, ie in this case Tensorflow. Whilst it might be theoretically possible to provide runtime link compatibility, sort of like Wine, I'm not aiming for that, and I think that will be ~~significantly~~ harder

To what extent do thes three assertions clear up some of the questions above?

hughperkins commented 7 years ago

(as far as, what are some obstacles to binary-level compatibility, a huge obvious one is enum values. For example, let's say in the client source-code there's a constant enum usage like:

if(cudaMalloc() == CUDA_OUT_OF_MEMORY) {
...
}

This means that Coriander needs to make sure that cudaMalloc returns CUDA_OUT_OF_MEMORY value. But, what is the value of CUDA_OUT_OF_MEMORY? The client sourcecode doesnt say. This leaves a couple of options that I can see:

look at the NVIDIA® CUDA™ Toolkit source-code. This seems undesirable, to avoid copyright issues etc
simply create a constant with the same name, but a nonsense value clearly different. This is the approach I've taken :), eg here:

https://github.com/hughperkins/coriander/blob/master/include/cocl/cocl_memory.h#L33-L38

enum cudaMemcpyKind {  // name used by thrust, in trivial_copy.inl, line 55 ish
    cudaMemcpyDeviceToHost=111,
    cudaMemcpyHostToDevice=222,
    cudaMemcpyDeviceToDevice=333,
    cudaMemcpyDefault=444  // from thrust, trivial_copy.inl
};

This means we need to at least recompile the client source-code, even if we dont need to actually modify it.

iame6162013 commented 7 years ago

So, if I where to create this host-side API, and compile the kernels with coriander, would I then be capable of ~running~ compiling & running tensorflow without any changes to their source code?

look at the NVIDIA® CUDA™ Toolkit source-code. This seems undesirable, to avoid copyright issues etc

Yes, a clean room design is the preferable.

This means we need to at least recompile the client source-code, even if we dont need to actually modify it.

That shouldn't be a problem for the use cases I care about. :)

hughperkins commented 7 years ago

So, if I where to create this host-side API, and compile the kernels with coriander, would I then be capable of running compiling & running tensorflow without any changes to their source code?

In fact, Coriander already implements much of the NVIDIA® CUDA™ API.

So, you may say, "what was actually changed in Tensorflow, in the tf-coriander repository?". Good question! Mostly it comes down to:

the build process. How are you going to build against Coriander? Tensorflow is built using bazel, which enforces a repeatable build process. Presumably you need to modify this in some way, to compile against Coriander, instead of compiling/linking against NVIDIA® CUDA™ Toolkit?
disabling all the stuff that doesnt work yet. By default, no GPU stuff is compiled, in Tensorflow, unless you define the #define GOOGLE_CUDA. I left this undefined, but then gradually uncommented out stuff, as I implemented it in Coriander. You have two options really:
- provide stubs for all the NVIDIA® CUDA™ calls that Tensorflow uses, in Coriander, and then define GOOGLE_CUDA, or
- ... well, that's the only option really :) . At least, if you want to avoid the current situation, where we leave GOOGLE_CUDA undefined, and then just uncomment this bit by bit
- examples of progressively uncommenting GOOGLE_CUDA:
  - add scatter: https://github.com/hughperkins/tf-coriander/commit/a1813a12b2f30b47b7f0c7b38bc77d72357fe6d2
  - add gather: https://github.com/hughperkins/tf-coriander/commit/7b9bb6375fc85736b524256dbc359a505355841d
the occasional bugfix, where eg code runs ok on Ubuntu, but there's a bug on Mac OS X, for whatever reason. These are few, though time-consuming

"But!", you say, "Whats all that nonsense in tensorflow/stream_executor/cl?". Well, yes, arguably you could remove that, and simply use the CUDA stream_executor directly. I've no idea how hard/easy that is. Both options seem to have their good/bad points. The current option, of creating a new cl stream_executor has the rather bad downside of making upgrading to new tensorflow versions look scary. On the other hand, since those files wont be modified by tensorflow upstream, they should be invariant across upgrade, approximately? Unclear.

hughperkins / tf-coriander

[Question] Why did you choose this approach? #62