lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
287 stars 94 forks source link

Kernel launches should use cudaLaunchKernel #372

Closed maddyscientist closed 7 years ago

maddyscientist commented 8 years ago

Throughout QUDA we presently use the triple chevron syntax for launching kernels, e.g.

kernel<type> <<<grid, block,shared,stream>>>(arg);

However, there exists a lower-latency API call for launching kernels cudaLaunchKernel. For the latency bound kernels, e.g., the dslash operators, we should modify the kernel launching to use this alternate form.

cudaLaunchKernel(kernel<type>, grid.x, grid.y, grid.z, block.x, block.y, block.z, arg, shared, stream);

Due to the nature of how the dslash kernels are presently built, this is a non-trivial amount of work, we have to deal with:

We needed to rewrite the dslash kernels anyway to simplify how they are generated, e.g., move to an all C++ approach instead of the present Python / macro / template approach. This rewrite would give us this easily.

On a hacked dslash kernel, I've verified that multi-GPU performance improves by as much as 10%.

Note that getting CUDA JIT (#362) up and running gives us this capability for free, since this would use the driver API, and it are these style kernel launches that the driver API uses.

alexstrel commented 8 years ago

A C++ generator seems to be a solution, e.g. , I'm thinking about something similar to approach adopted in QPhiX library. It has two levels , in which for kernel bodies, the low level, one can reuse the stuff produced by our python scripts. However, for the high level it's necessary to rewrite *_def.h files.

maddyscientist commented 7 years ago

Closing this issue: the main motivation for doing this was because it was an alternative to #304. Since that's been done, we can close this.