Feature/large kernel args

lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.

https://lattice.github.io/quda

Other

289 stars 97 forks source link

Feature/large kernel args #1364

Closed maddyscientist closed 1 year ago

maddyscientist commented 1 year ago

This PR adds support for large kernel argument support to QUDA

On Volta and above, CUDA now supports kernel arguments sized up 32,764 KiB.
This functionality was introduced with CUDA 12.1, and requires NVIDIA graphics driver >= 530
This feature provides a significant performance boost for short-running kernels that require large kernel arguments, since now the pre-kernel h-d memcpy is no longer needed, and we avoid a CUDA API call
The QUDA CMake advanced option QUDA_LARGE_KERNEL_ARG enables utilization of this feature (for now left as OFF by default)
When enabled, the constant_kernel_arg.h header serves no purpose, and its contents are hidden from the compiler

weinbe2 commented 1 year ago

Should large kernel args be added to the tunekey string, at least for kernels whose argument is larger than 4KB? Or do the kernel arg vs constant memory versions of the functions have appropriately different kernel names?

weinbe2 commented 1 year ago

I have a few outstanding questions + comments, but overall I'm excited about this PR. I'm taking it for a spin now, and pending issues there and thoughts on my various comments this should be good to go.

maddyscientist commented 1 year ago

Should large kernel args be added to the tunekey string, at least for kernels whose argument is larger than 4KB? Or do the kernel arg vs constant memory versions of the functions have appropriately different kernel names?

Thanks Evan for this suggestion. I've implemented this idea here: https://github.com/lattice/quda/pull/1364/commits/bad9df8302f1e739721dcb6ca3d5b79193355bdd

maddyscientist commented 1 year ago

I've now done some testing of the effect of this optimization on Ampere, and the speedup is impressive. Below is for the block caxpy kernel running in single precision with 32x32 (performance in GFLOPS):

L	constant	large arg
2	64.4	100.0
4	977	1500
6	2720	3640
8	4760	6080
10	4680	6440
12	3170	5120
14	3130	4970
16	3190	4960

Moreover, the minimum runtime decreases from ~12 us to <8 us.