[Feature request] Realtime compiled kernels

szagoruyko commented 9 years ago

Following https://github.com/hughperkins/cltorch/issues/2. The API has to be the similar to cutorch-rtc (example here https://github.com/szagoruyko/cutorch-rtc/blob/master/test.lua#L95) with may be different names for the functions (without PTX for example). So there has to be two functions:

function cltorch.compile(string)

taking the kernel code as the argument and storing the compiled code somewhere inside cltorch and

function cltorch.launch(kernel_name, arguments, global_work_offset, global_work_size, local_work_size)

arguments in the form of a table (in cutorch-rtc I did it in ffi here https://github.com/szagoruyko/cutorch-rtc/blob/master/init.lua#L8). In this form only textures are out of the boat, still haven't thought how to add them to cutorch-rtc. local_work_size should be taken from max device properties if not specified. Kernels should have just raw pointers as arguments, I wouldn't bother with strides, just enforce contiguity of input and output tensors. Is there a way to compile with CPU support btw?

hughperkins commented 9 years ago

Hmmm, it's lower-level than I imagined it. For example, I'm not sure I envisaged the client needing to provide the grid and block sizes. I suppose I imagined someting like apply/map/map2, jsut with bigger kernels. Can you perhaps give me some higher level background of your use-case? At the moment, in my head I'm just imagining apply/map etc really. What kinds of things do you want to do that dont fit into apply/map?

arguments as a table makes sense. I suppose it can initially take ClTensors and floats. Can add more types later, if needed.

as far as contiguity, I think that should be an option somehow possibly, but obviously if you're going to be the first/main client, probably should configure the default option, and the only initial option, to be the one that you will use.

szagoruyko commented 9 years ago

With this level you'll basically will be able to write nn modules directly in lua. So clnn could be just lua library. You code in lua using ClTensors and if there is a need to speed up something you write a kernel.

Then actually contiguity is not a problem, one could provide strides as arguments in the table.

soumith commented 9 years ago

RTC kernels will be a nightmare for debugging and for general speed of iterative development (unless compiles are cached properly and always). Theano is a good example of why we should tread this path with care.

szagoruyko commented 9 years ago

@soumith due to the nature of OpenCL kernels are already realtime compiled and compilations are supposed to be cached by the driver. It's just moving the place where they compiled from C++ to lua.

hughperkins commented 9 years ago

Ok, I see. So some middle ground perhaps between 314 lines for LogSoftMax in cunn and 48 lines for LogSoftMax in clnn. Sounds like a reasonable direction to try.

I think it might be nice if the global workgroup size was set by default to the number of elements in the first output tensor, but can be overridden, by simplying providing it as a parameter.

@Sergey: Question, what are your thoughts on how to differentiate between input, output, and inputoutput tensors? Basically, we copy from host to device for input tensors, visa versa for output tensors, and both for inout tensors. But need some way to tell the API which are in, out etc. Or we can always copy, in both directions, but thats a huge performance hit.

@Sergey: another question: up till last night, there was some doubt in my mind about the ability to pass in sizes, strides and so on as a struct without impacting performance, but turns out that the horrible Apply performance I was getting was because passing structs to other functions by value in Intel HD gives a huge performance hit. Even for inline functions. Changing to passing htem by pointer gave a 20x speed boost :-P I think it would be good to make the sizes and so on available in the kernel code, automatically, by just passing the tensor into the table. Since getting the user to write the parameters to the kernel by hand would be buggy, I'm thining of getting cltorch to write these automatically. In the specific, it could be used something like this:

akernel = torch.ClKernel({output={a='ClTensor'}, input={b='ClTensor', nElements='int'}, src=[[
    if(get_global_id(0) >= nElements) {
        return;
    }
    // just some simple apply-type thing, as example:
    a_data[aInfo->storageOffset + get_global_id(0)] = 
        b_data[bInfo->storageOffset + get_global_id(0)] * 5 + 23.0f; 
]]})

akernel:run({a=x, b=y, nElements=x:nElement()})

Thoughts?

hughperkins commented 9 years ago

Let me know if you're ok with this. I'm kind of interested to implement it. I dont have any great ideas for identifying in vs out tensors, other have having two or three tables, one for input, one for output, and maybe one for inout.

One other nuance: we presumably wnat to be able to define additional functions in the kernel, not just one single kernel function. Two options spring to mind:

Option 1: templated:

akernel = torch.ClKernel({output={a='ClTensor'}, input={b='ClTensor', nElements='int'}, src=[[
void someOtherFunc(global float *a, global float *b){
   *a = *a + *b;
}

{{kernel_header}}
    if(get_global_id(0) >= nElements) {
        return;
    }
    // just some simple apply-type thing, as example:
    a_data[aInfo->storageOffset + get_global_id(0)] = 
        b_data[bInfo->storageOffset + get_global_id(0)] * 5 + 23.0f; 
{{kernel_footer}}
]]})

Option 2, pass in additional functions through an optional parameter:

akernel = torch.ClKernel({input={a='ClTensor'}, output={b='ClTensor', nElements='int'}, src=[[
    if(get_global_id(0) >= nElements) {
        return;
    }
    // just some simple apply-type thing, as example:
    a_data[aInfo->storageOffset + get_global_id(0)] = 
        b_data[bInfo->storageOffset + get_global_id(0)] * 5 + 23.0f; 
]], funcs=[[
    void someOtherFunc(global float *a, global float *b){
       *a = *a + *b;
    }
]]})

On the whole, passing in the other methods through an adidtional (optional) parameter looks cleaner somehow.

hughperkins commented 9 years ago

Well, I kind of feel motivated to work on this now, so I might just write a first draft now, and then go from there.

hughperkins commented 9 years ago

could do with a snappy-name. I suck at those. RTKernel is accurate in CUDA, but kind of tortology in OpenCL. So far, I just came up with 'CustomKernel'. Maybe it will be Custom.cpp and Custom.h. Freestyle perhaps. UserKernel. (Edit: going with UserKernel.cpp/.h for now)

Edit: updated the lua samples above:

fixed the dictionary initializer syntax
pass in a table now, with inner tables in, out, and source code in src (Edit: and other options could extend this table, or be in options)

Edit2:

other funcs will be passed as additional table parameter funcs

Edit3:

kernel source passed as parameter kernel rather than src, which fits in better with other funcs in funcs

Edit4:

kernel source passed a src again, easier to type
in becomes input, since in is reserved parameter
and out becomes output, for consistency

Edit5:

I reckon the kernel should have a name, for profiling purposes etc, maybe just make it default to user_kernel for now though, and then can add parameter name=, optionally

Edit6:

create a kernel like akernel = torch.ClKernel(... params here ...) rather than akernel = cltorch.kernel( ...params here...)

(Edit: by the way, can follow progress here: UserKernel.cpp, and corresponding test script src/test/test_userkernel.lua

hughperkins commented 9 years ago

Working now :-) Please install cltorch from branch custom-kernel

Run this script:

require 'cltorch'

k = torch.ClKernel({input={nElements='int', input='torch.ClTensor'},output={output='torch.ClTensor'},src=[[
   int linearId = get_global_id(0);
   if(linearId < nElements) {
     output_data[linearId] = input_data[linearId] + 3.0f;
   }
]]})
print('k', k)
k:print()

x = torch.ClTensor({3,5,2})
y = torch.ClTensor({6,4,2})
print('x before\n', x)
print('y before\n', y)

k:run({nElements=3, input=x, output=y})

print('y after\n', y)

Output:

Using Intel platform: Intel Gen OCL Driver
Using device: Intel(R) HD Graphics BroadWell U-Processor GT2
k   torch.ClKernel
source=   int linearId = get_global_id(0);
   if(linearId < nElements) {
     output_data[linearId] = input_data[linearId] + 3.0f;
   }

x before
     3
 5
 2
[torch.ClTensor of size 3]

y before
     6
 4
 2
[torch.ClTensor of size 3]

y after
     6
 8
 5
[torch.ClTensor of size 3]

Code here: src/UserKernel.cpp

Edit: can pass options to :run now, like:

k:run({nElements=3, input=x, output=y}, {numWorkgroups=10, workgroupSize=32})

hughperkins commented 9 years ago

updated doc
merged to master

=> closed

szagoruyko commented 9 years ago

I don't see use-cases for this, most of the time your tensors have different size, that's the biggest issue. Then what if you want to run on a funny subset of a tensor or do simple reduction. Or have functions outside of the kernel.

hughperkins commented 9 years ago

numWorkgroups and workgorupSize are existing available options, in the second parameter to run line 348
additional kernel functions can be provided to the existing funcs parameter of torch.ClKernel line 231
am I right in thinking that you were intending to implement this yourself? If so, please mark this clearly on your issues in the future. There is an option on each issue to assign an issue to yourself.

szagoruyko commented 9 years ago

Thanks for the clarification Hugh. Still I don't see places were I could apply your current implementation in my projects. Is there at least a way to expose clContext, clDevice and clBuffers pointers to lua? I could implement this myself sometime in the future when I need it, right now I've got many things that need to be implemented in CUDA first.

hughperkins commented 9 years ago

Looking at your original request, I see no mention of requirement to expose context, device, or buffer pointers. However, it's not hard to expose these. But... they'd just be an opaque black box I guess, like a userdata, so you'd probably need to define what you want to do with them. I guess the most obvious thing one might want to do is use multiple command queues? (Edit: or you'd somehow use these through FFI, so the raw pointer is sufficient?)

szagoruyko commented 9 years ago

so the original idea of this PR was: you have an OpenCL kernel, you compile it and run. Now what you did is: you have to write something looking like an OpenCL kernel or edit an existing kernel, compile it and run. Kernels might come from different sources, you might just load them from a different library or store in a separate file to be able to compile and profile or analyze with some utility program, so it's good to have the same language. That's what cutorch-rtc allows you, but it's something new to CUDA and natural in OpenCL.

So if you'd be exposing pointers to opencl structures I could marry cltorch and lua-opencl to have this low-level and have torch tensors power.

hughperkins commented 9 years ago

I think you need to think through how exactly such kernels would look, and how exactly you would use them. All the kernels you've provided me to date have been CUDA-kernels, and you cannot typically pass a CUDA-kernel to OpenCL, for a number of reasons, and not just the syntax:

syntax however is the most obvious one: CUDA uses different syntax to OpenCL, but this is trivially handled, using search and replace
CUDA can receive the tensor data as pointers to float float *, and this pointer can be modified, ie the storageOffset can be added to it. This is not the case with OpenCL: the offset must be explicitly passed into the kernel. You can make a simplifying assumption that none of the tensors you receive will have a non-zero offset, but this violation will be violated as soon as someone does a Narrow or a Select. For your use-case of creating network modules, this is an often-violated assumption, eg LSTM makes heavy use of Narrow
Similarly, you say that it's acceptable to make a simplifying assumption that tensors will be contiguous, and presumably not be transposed and so on. This is not a reliable assumption for network modules, and therefore the sizes and strides need to be made available, or else you need to copy the tensor beforehand, to make it contiguous, which is an expensive operation
CUDA kernels are not stored in simple standalone files, but are typically provided inside .cu files, or .cuh files, which are a Frankenstein mixture of c++ and kernels, or rather of code that runs on the host, and code that runs on the device, and there is no obvious automated way of separating these out, though port.py attempts this
in the specific case of cutorch and cunn kernels, there is heavy use of thrust, which has no obvious equivalent in OpenCL, each case must be handled case by case
CUDA uses C++ templates. There is no obvious way of porting these automatically to OpenCL, though it can be ported, eg using Lua templating

So, I think you need to think about how exactly your kernels will look like, where they will come from, how you will use them. Maybe even try to implement what you want yourself, and see how it looks like?

hughperkins commented 9 years ago

By the way, I have no objection to your providing an alternative implementation of this. Either from scratch, or Frankenversion of what I have, both ok. Either in a fork of cltorch, or, you know what, one of the super-cool things about Torch, cf caffe, is how easy it is to make new modules, monkey-patch things in. So you could create eg cltorch-rtc, and port cutorch-rtc into that. Happy to answer any technical questions you might have on how to approach that by the way.

szagoruyko commented 9 years ago

Hi Hugh, I finally have an example for you here https://github.com/szagoruyko/cunn-rtc/blob/master/im2col.lua and here https://github.com/szagoruyko/cunn-rtc/blob/master/SpatialConvolutionRTC.lua, im2col and col2im are compiled in runtime and it is a bit faster in some cases. cudnn is times faster of course, but the point of this was to have a clean convolution in lua to experiment with. I think you could have the same lua wrapper and just write im2col and col2im opencl wrapper (doesn't matter if it would be in C++ side or lua as mine), and probably it would be faster too.

hughperkins commented 9 years ago

Thanks! Yes, those should be baked in. But, dont need user custom kernels for this, can do it directly in SpatialConvolutionMM, eg see how Gather and Scatter bake in the loop sizes https://github.com/hughperkins/cltorch/blob/master/src/lib/THClGather.cl

szagoruyko commented 9 years ago

yes sure that would work too

hughperkins / cltorch

[Feature request] Realtime compiled kernels #13