Closed maddyscientist closed 1 year ago
Should large kernel args be added to the tunekey string, at least for kernels whose argument is larger than 4KB? Or do the kernel arg vs constant memory versions of the functions have appropriately different kernel names?
I have a few outstanding questions + comments, but overall I'm excited about this PR. I'm taking it for a spin now, and pending issues there and thoughts on my various comments this should be good to go.
Should large kernel args be added to the tunekey string, at least for kernels whose argument is larger than 4KB? Or do the kernel arg vs constant memory versions of the functions have appropriately different kernel names?
Thanks Evan for this suggestion. I've implemented this idea here: https://github.com/lattice/quda/pull/1364/commits/bad9df8302f1e739721dcb6ca3d5b79193355bdd
I've now done some testing of the effect of this optimization on Ampere, and the speedup is impressive. Below is for the block caxpy kernel running in single precision with 32x32 (performance in GFLOPS):
L | constant | large arg |
---|---|---|
2 | 64.4 | 100.0 |
4 | 977 | 1500 |
6 | 2720 | 3640 |
8 | 4760 | 6080 |
10 | 4680 | 6440 |
12 | 3170 | 5120 |
14 | 3130 | 4970 |
16 | 3190 | 4960 |
Moreover, the minimum runtime decreases from ~12 us to <8 us.
This PR adds support for large kernel argument support to QUDA
QUDA_LARGE_KERNEL_ARG
enables utilization of this feature (for now left asOFF
by default)constant_kernel_arg.h
header serves no purpose, and its contents are hidden from the compiler