vgvassilev / clad

clad -- automatic differentiation for C/C++
GNU Lesser General Public License v3.0
277 stars 119 forks source link

Call to `clad::gradient` for CUDA kernels cannot be compiled for GPU #1036

Closed kchristin22 closed 1 week ago

kchristin22 commented 1 month ago

Since global kernels cannot be called like normal device functions from other device functions, the following command can be compiled only for host like so:

#ifndef __CUDA_ARCH__
    auto kernel_g = clad::gradient(kernel);
#endif

When executing the kernel though, the device needs to recognize(="see") the kernel. Hence, when trying to execute it, the device is not able to read the kernel (it says: "device symbol not found").

A simple workaround is generating a header file with a fixed name in the build folder(or in inc) like the compiler option -fgenerate-source-file does, placing the derived kernel there and including the file in Differentiator.h. The user would call the gradient with the macros like above.

There are some other ways that could also work, like loading the kernel to GPU manually using CUDA Modules. However, this may require assemblied code in string and not the c++ code.

Also, transferring (allocating) a function pointer from host to device using cudaMemcyToSymbol does not work for kernels.

Maybe there could be a gradient_kernel function that is also a kernel and the returned object is an argument of that function (as kernels are void functions), but it is speculation that it could work and there may be problem with execute down the line.

vgvassilev commented 1 month ago

I think I understand the problem. Can you write a simple standalone example with a gradient written by hand which compiles. Then we can think of how to implement that in clad. I'd be interested to see the different scenarios which a gradient on device is useful and its usage pattern in the GPU programming model.

kchristin22 commented 1 month ago

I made some progress with this and made it work in cuda-compilation-support branch. At compile time, whenever a global kernel is differentiated, the code is stored in a file, compiled using clang cuda and its assembly for the GPU (ptx code) is extracted in another file. The latter file can be loaded at runtime on the GPU, its function handle acquired and launched with a specified configuration.

Note that this code was used only to see what works and some things were taken for granted (the visible name of the kernel in the ptx file for instance) and of course the CUDA Module instructions should not be visible to the user.

I want to work on:

kchristin22 commented 1 month ago

Update: The CUDA module commands have been successfully transferred to the execute function in the cuda-compilation-support branch, so the usage indicated in the test closer resembles the final one. However, since the user should provide the configuration, a different execute function (i.e. execute_kernel()) could be created to have an extra arg for the grid's configuration. The necessary assertions will be added using the flag m_CUDAkernel of the CladFunction object (this is a new feature included in this branch).

kchristin22 commented 3 weeks ago

Even when being able to compile for both host and device the derived kernel and those functions have the same signature, the functions are not linked appropriately. As a result, we need to compile the kernel dynamically after it's compiled for the host. The source string is used to compile the kernel and produce its PTX code. The latter can be used to load the module on the GPU and get its function handle. To get the function handle we need the name of the kernel so it has to be visible to the CUDA Driver API, hence the addition of extern "C" in the beginning of the function's string. Moreover, despite being able to clone the global attribute of the original function to the derived one, NVRTC cannot handle the global attribute in this format: __attribute__((global)), but needs it to be included in the function's string as __global__. Thus, we don't clone this attribute and when the derived kernel is printed it's shown without any global attribute. The final kernel returned to gradient though is a global kernel.

The generation of the PTX code happens at compile time and is passed to the CladFunction object, along with the kernel's name. The module is load at runtime when the CladFunction object is created. A separate execute_kernel function is used to execute the kernel with a specified grid configuration. Ideally, the module would be loaded at compile time and return its function handle to gradient, but this approach faced linking problems when creating an AST node for the CUfunction.

kchristin22 commented 2 weeks ago

Even when being able to compile for both host and device the derived kernel and those functions have the same signature, the functions are not linked appropriately.

With the help and guidance of @parth-07, the issue was not an error in the linking of the same global device and host function after all, when the two where compiled accordingly. That is, if the returned function to the gradient call on the host side is the overloaded one- used for typecasting of the args- and the overloaded function was successfully compiled for the device as well, then the compiler should be able to trace that. The same applies in case the internal to the overload function is returned (the derivative function that is printed for the user to see).

However, when compiling for the device, the overload could not be computed as it would involve a global kernel call inside another global kernel call.

// overload function
__attribute__((global)) void add_grad(double *a, double *b, double *c, void *_temp__d_a0, void *_temp__d_b0, void *_temp__d_c0) {
    double *_d_a = (double *)_temp__d_a0;
    double *_d_b = (double *)_temp__d_b0;
    double *_d_c = (double *)_temp__d_c0;
    add_grad<<<1, 1>>>(a, b, c, _d_a, _d_b, _d_c);
}
// derivative function, internally called in overload
__attribute__((global)) void add_grad(double *a, double *b, double *c, double *_d_a, double *_d_b, double *_d_c) {
    double _t0 = c[0];
    c[0] = a[0] + b[0];
    {
        c[0] = _t0;
        double _r_d0 = _d_c[0];
        _d_c[0] = 0;
        _d_a[0] += _r_d0;
        _d_b[0] += _r_d0;
    }
}

This is the reason behind the prompted error previously, which occurred after adding a configuration expression in the creation of a call to the derivative function in BuildCallExprToFunction, used when creating the overloaded function: Illegal call target, device function expected

Since the overloaded function is what is returned to the user after all, by identifying the global attribute of the function when creating its call to be included in the overload, replacing it with a device attribute, and keeping the overload function a global kernel, the behavior of the function's parallelism is correct and the kernel is visible to the device as well.

As a result, this approach makes the use of the CUDA API inside clad unnecessary.