hughperkins / cltorch

An OpenCL backend for torch.
Other
289 stars 26 forks source link

[Feature request] Realtime compiled kernels #13

Closed szagoruyko closed 9 years ago

szagoruyko commented 9 years ago

Following https://github.com/hughperkins/cltorch/issues/2. The API has to be the similar to cutorch-rtc (example here https://github.com/szagoruyko/cutorch-rtc/blob/master/test.lua#L95) with may be different names for the functions (without PTX for example). So there has to be two functions:

function cltorch.compile(string)

taking the kernel code as the argument and storing the compiled code somewhere inside cltorch and

function cltorch.launch(kernel_name, arguments, global_work_offset, global_work_size, local_work_size)

arguments in the form of a table (in cutorch-rtc I did it in ffi here https://github.com/szagoruyko/cutorch-rtc/blob/master/init.lua#L8). In this form only textures are out of the boat, still haven't thought how to add them to cutorch-rtc. local_work_size should be taken from max device properties if not specified. Kernels should have just raw pointers as arguments, I wouldn't bother with strides, just enforce contiguity of input and output tensors. Is there a way to compile with CPU support btw?

hughperkins commented 9 years ago

Hmmm, it's lower-level than I imagined it. For example, I'm not sure I envisaged the client needing to provide the grid and block sizes. I suppose I imagined someting like apply/map/map2, jsut with bigger kernels. Can you perhaps give me some higher level background of your use-case? At the moment, in my head I'm just imagining apply/map etc really. What kinds of things do you want to do that dont fit into apply/map?

arguments as a table makes sense. I suppose it can initially take ClTensors and floats. Can add more types later, if needed.

as far as contiguity, I think that should be an option somehow possibly, but obviously if you're going to be the first/main client, probably should configure the default option, and the only initial option, to be the one that you will use.

szagoruyko commented 9 years ago

With this level you'll basically will be able to write nn modules directly in lua. So clnn could be just lua library. You code in lua using ClTensors and if there is a need to speed up something you write a kernel.

Then actually contiguity is not a problem, one could provide strides as arguments in the table.

soumith commented 9 years ago

RTC kernels will be a nightmare for debugging and for general speed of iterative development (unless compiles are cached properly and always). Theano is a good example of why we should tread this path with care.

szagoruyko commented 9 years ago

@soumith due to the nature of OpenCL kernels are already realtime compiled and compilations are supposed to be cached by the driver. It's just moving the place where they compiled from C++ to lua.

hughperkins commented 9 years ago

Ok, I see. So some middle ground perhaps between 314 lines for LogSoftMax in cunn and 48 lines for LogSoftMax in clnn. Sounds like a reasonable direction to try.

I think it might be nice if the global workgroup size was set by default to the number of elements in the first output tensor, but can be overridden, by simplying providing it as a parameter.

@Sergey: Question, what are your thoughts on how to differentiate between input, output, and inputoutput tensors? Basically, we copy from host to device for input tensors, visa versa for output tensors, and both for inout tensors. But need some way to tell the API which are in, out etc. Or we can always copy, in both directions, but thats a huge performance hit.

@Sergey: another question: up till last night, there was some doubt in my mind about the ability to pass in sizes, strides and so on as a struct without impacting performance, but turns out that the horrible Apply performance I was getting was because passing structs to other functions by value in Intel HD gives a huge performance hit. Even for inline functions. Changing to passing htem by pointer gave a 20x speed boost :-P I think it would be good to make the sizes and so on available in the kernel code, automatically, by just passing the tensor into the table. Since getting the user to write the parameters to the kernel by hand would be buggy, I'm thining of getting cltorch to write these automatically. In the specific, it could be used something like this:

akernel = torch.ClKernel({output={a='ClTensor'}, input={b='ClTensor', nElements='int'}, src=[[
    if(get_global_id(0) >= nElements) {
        return;
    }
    // just some simple apply-type thing, as example:
    a_data[aInfo->storageOffset + get_global_id(0)] = 
        b_data[bInfo->storageOffset + get_global_id(0)] * 5 + 23.0f; 
]]})

akernel:run({a=x, b=y, nElements=x:nElement()})

Thoughts?

hughperkins commented 9 years ago

Let me know if you're ok with this. I'm kind of interested to implement it. I dont have any great ideas for identifying in vs out tensors, other have having two or three tables, one for input, one for output, and maybe one for inout.

One other nuance: we presumably wnat to be able to define additional functions in the kernel, not just one single kernel function. Two options spring to mind:

Option 1: templated:

akernel = torch.ClKernel({output={a='ClTensor'}, input={b='ClTensor', nElements='int'}, src=[[
void someOtherFunc(global float *a, global float *b){
   *a = *a + *b;
}

{{kernel_header}}
    if(get_global_id(0) >= nElements) {
        return;
    }
    // just some simple apply-type thing, as example:
    a_data[aInfo->storageOffset + get_global_id(0)] = 
        b_data[bInfo->storageOffset + get_global_id(0)] * 5 + 23.0f; 
{{kernel_footer}}
]]})

Option 2, pass in additional functions through an optional parameter:

akernel = torch.ClKernel({input={a='ClTensor'}, output={b='ClTensor', nElements='int'}, src=[[
    if(get_global_id(0) >= nElements) {
        return;
    }
    // just some simple apply-type thing, as example:
    a_data[aInfo->storageOffset + get_global_id(0)] = 
        b_data[bInfo->storageOffset + get_global_id(0)] * 5 + 23.0f; 
]], funcs=[[
    void someOtherFunc(global float *a, global float *b){
       *a = *a + *b;
    }
]]})

On the whole, passing in the other methods through an adidtional (optional) parameter looks cleaner somehow.

hughperkins commented 9 years ago

Well, I kind of feel motivated to work on this now, so I might just write a first draft now, and then go from there.

hughperkins commented 9 years ago

could do with a snappy-name. I suck at those. RTKernel is accurate in CUDA, but kind of tortology in OpenCL. So far, I just came up with 'CustomKernel'. Maybe it will be Custom.cpp and Custom.h. Freestyle perhaps. UserKernel. (Edit: going with UserKernel.cpp/.h for now)

Edit: updated the lua samples above:

Edit2:

Edit3:

Edit4:

Edit5:

Edit6:

(Edit: by the way, can follow progress here: UserKernel.cpp, and corresponding test script src/test/test_userkernel.lua

hughperkins commented 9 years ago

Working now :-) Please install cltorch from branch custom-kernel

Run this script:

require 'cltorch'

k = torch.ClKernel({input={nElements='int', input='torch.ClTensor'},output={output='torch.ClTensor'},src=[[
   int linearId = get_global_id(0);
   if(linearId < nElements) {
     output_data[linearId] = input_data[linearId] + 3.0f;
   }
]]})
print('k', k)
k:print()

x = torch.ClTensor({3,5,2})
y = torch.ClTensor({6,4,2})
print('x before\n', x)
print('y before\n', y)

k:run({nElements=3, input=x, output=y})

print('y after\n', y)

Output:

Using Intel platform: Intel Gen OCL Driver
Using device: Intel(R) HD Graphics BroadWell U-Processor GT2
k   torch.ClKernel
source=   int linearId = get_global_id(0);
   if(linearId < nElements) {
     output_data[linearId] = input_data[linearId] + 3.0f;
   }

x before
     3
 5
 2
[torch.ClTensor of size 3]

y before
     6
 4
 2
[torch.ClTensor of size 3]

y after
     6
 8
 5
[torch.ClTensor of size 3]

Code here: src/UserKernel.cpp

Edit: can pass options to :run now, like:

k:run({nElements=3, input=x, output=y}, {numWorkgroups=10, workgroupSize=32})
hughperkins commented 9 years ago

=> closed

szagoruyko commented 9 years ago

I don't see use-cases for this, most of the time your tensors have different size, that's the biggest issue. Then what if you want to run on a funny subset of a tensor or do simple reduction. Or have functions outside of the kernel.

hughperkins commented 9 years ago
szagoruyko commented 9 years ago

Thanks for the clarification Hugh. Still I don't see places were I could apply your current implementation in my projects. Is there at least a way to expose clContext, clDevice and clBuffers pointers to lua? I could implement this myself sometime in the future when I need it, right now I've got many things that need to be implemented in CUDA first.

hughperkins commented 9 years ago

Looking at your original request, I see no mention of requirement to expose context, device, or buffer pointers. However, it's not hard to expose these. But... they'd just be an opaque black box I guess, like a userdata, so you'd probably need to define what you want to do with them. I guess the most obvious thing one might want to do is use multiple command queues? (Edit: or you'd somehow use these through FFI, so the raw pointer is sufficient?)

szagoruyko commented 9 years ago

so the original idea of this PR was: you have an OpenCL kernel, you compile it and run. Now what you did is: you have to write something looking like an OpenCL kernel or edit an existing kernel, compile it and run. Kernels might come from different sources, you might just load them from a different library or store in a separate file to be able to compile and profile or analyze with some utility program, so it's good to have the same language. That's what cutorch-rtc allows you, but it's something new to CUDA and natural in OpenCL.

So if you'd be exposing pointers to opencl structures I could marry cltorch and lua-opencl to have this low-level and have torch tensors power.

hughperkins commented 9 years ago

I think you need to think through how exactly such kernels would look, and how exactly you would use them. All the kernels you've provided me to date have been CUDA-kernels, and you cannot typically pass a CUDA-kernel to OpenCL, for a number of reasons, and not just the syntax:

So, I think you need to think about how exactly your kernels will look like, where they will come from, how you will use them. Maybe even try to implement what you want yourself, and see how it looks like?

hughperkins commented 9 years ago

By the way, I have no objection to your providing an alternative implementation of this. Either from scratch, or Frankenversion of what I have, both ok. Either in a fork of cltorch, or, you know what, one of the super-cool things about Torch, cf caffe, is how easy it is to make new modules, monkey-patch things in. So you could create eg cltorch-rtc, and port cutorch-rtc into that. Happy to answer any technical questions you might have on how to approach that by the way.

szagoruyko commented 9 years ago

Hi Hugh, I finally have an example for you here https://github.com/szagoruyko/cunn-rtc/blob/master/im2col.lua and here https://github.com/szagoruyko/cunn-rtc/blob/master/SpatialConvolutionRTC.lua, im2col and col2im are compiled in runtime and it is a bit faster in some cases. cudnn is times faster of course, but the point of this was to have a clean convolution in lua to experiment with. I think you could have the same lua wrapper and just write im2col and col2im opencl wrapper (doesn't matter if it would be in C++ side or lua as mine), and probably it would be faster too.

hughperkins commented 9 years ago

Thanks! Yes, those should be baked in. But, dont need user custom kernels for this, can do it directly in SpatialConvolutionMM, eg see how Gather and Scatter bake in the loop sizes https://github.com/hughperkins/cltorch/blob/master/src/lib/THClGather.cl

szagoruyko commented 9 years ago

yes sure that would work too