morousg / cvGPUSpeedup

A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!
Apache License 2.0
34 stars 5 forks source link

Explore runtime cuda compilation #90

Open morousg opened 7 months ago

morousg commented 7 months ago

One of the limitations of the Fused Kernel library, is that we can't define the kernel at runtime.

In this issue, we want to explore the option of using nvrtc to dynamically compile runtime defined kernels.

As per nvrtc documentation, this compiler only compiles GPU code. That is functions decorated with __global__ or __device__, and global functions decorated with __constant__.

All included files that are not part of the CUDA sdk, need to be explicitly specified. I our case it would be the entire fused_kernel folder (except for the fused_kernel.cuh file)

In our library, all __global__ functions are in the file include/fused_kernel/core/execution_model/grid_patterns.cuh So far, there are 3 of them.

When compiling, we have to tell nvrtc which of the __global__ functions we want to compile, and since they are all template functions, we have also to specify the template parameters.

The template parameters will be the Device functions. For example Read<PerThreadRead<_2D,uchar4>>, would be the first template parameter, then the rest of parameters would be any combination of Unary<>, Binary<> or MidWrite<> Device Functions, and the last one will be a Write<> device function.

Therefore, what we need is an automatic way to translate a Device Function type into an string with all it's type in it. For instance, the Device Function Read<PerThreadRead<_2D,uchar4>> should be translated into the string "Read<PerThreadRead<_2D,uchar4>>" o "ReadDeviceFunction<PerThreadRead<_2D,uchar4>>". Notice that Read is an alias of ReadDeviceFunction.

Another thing we need, is a class that internally uses the nvrtc and accumulates the seqüence of different instances of Device Functions (wich contain the DeviceFunction parameters). It should also provide a method to compile, and another to execute the kernel.

Compiled kernels should be stored in a data structure that allows us to check if the given kernel was already compiled previously.

How to store instances of different types in C++? Well, we can use std::any, or store them in an std::vector, and then get the pointer with std::addressof() to pass them as void to the kernel. Maybe we need to store the amount of bytes they occupy. Since we will have to use the CUDA Driver API to launch the kernel, we will have to see how to pass structs via the Driver API, ideally without allocating memory. It would be great if we don't need to recover the type of the parameters, because the moment we launch the kernel, we will only have the strings representing the types, and void to the parameters. If we want to store the instances in an array instead of storing pointers to the instances, we can explore using std::any, and see if we can safely obtain a void* to them.

There are some other considerations to take into account, but we will let them for another issue.