Kernel launches should use cudaLaunchKernel

maddyscientist commented 8 years ago

Throughout QUDA we presently use the triple chevron syntax for launching kernels, e.g.

kernel<type> <<<grid, block,shared,stream>>>(arg);

However, there exists a lower-latency API call for launching kernels cudaLaunchKernel. For the latency bound kernels, e.g., the dslash operators, we should modify the kernel launching to use this alternate form.

cudaLaunchKernel(kernel<type>, grid.x, grid.y, grid.z, block.x, block.y, block.z, arg, shared, stream);

Due to the nature of how the dslash kernels are presently built, this is a non-trivial amount of work, we have to deal with:

Since the function name is built up uses the C preprocessor, constructing the function name to get the function pointer is a pain.
The fact that kernel names are overloaded which increases the pain in getting the correct funciton pointer. E.g., see http://stackoverflow.com/questions/2942426/how-to-specify-a-pointer-to-an-overloaded-function.
We also have to get the template arguments to get the correct function pointer.

We needed to rewrite the dslash kernels anyway to simplify how they are generated, e.g., move to an all C++ approach instead of the present Python / macro / template approach. This rewrite would give us this easily.

On a hacked dslash kernel, I've verified that multi-GPU performance improves by as much as 10%.

Note that getting CUDA JIT (#362) up and running gives us this capability for free, since this would use the driver API, and it are these style kernel launches that the driver API uses.

alexstrel commented 8 years ago

A C++ generator seems to be a solution, e.g. , I'm thinking about something similar to approach adopted in QPhiX library. It has two levels , in which for kernel bodies, the low level, one can reuse the stuff produced by our python scripts. However, for the high level it's necessary to rewrite *_def.h files.

maddyscientist commented 7 years ago

Closing this issue: the main motivation for doing this was because it was an alternative to #304. Since that's been done, we can close this.

lattice / quda

Kernel launches should use cudaLaunchKernel #372