lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

Feature/large kernel args #1364

Closed maddyscientist closed 1 year ago

maddyscientist commented 1 year ago

This PR adds support for large kernel argument support to QUDA

weinbe2 commented 1 year ago

Should large kernel args be added to the tunekey string, at least for kernels whose argument is larger than 4KB? Or do the kernel arg vs constant memory versions of the functions have appropriately different kernel names?

weinbe2 commented 1 year ago

I have a few outstanding questions + comments, but overall I'm excited about this PR. I'm taking it for a spin now, and pending issues there and thoughts on my various comments this should be good to go.

maddyscientist commented 1 year ago

Should large kernel args be added to the tunekey string, at least for kernels whose argument is larger than 4KB? Or do the kernel arg vs constant memory versions of the functions have appropriately different kernel names?

Thanks Evan for this suggestion. I've implemented this idea here: https://github.com/lattice/quda/pull/1364/commits/bad9df8302f1e739721dcb6ca3d5b79193355bdd

maddyscientist commented 1 year ago

I've now done some testing of the effect of this optimization on Ampere, and the speedup is impressive. Below is for the block caxpy kernel running in single precision with 32x32 (performance in GFLOPS):

L constant large arg
2 64.4 100.0
4 977 1500
6 2720 3640
8 4760 6080
10 4680 6440
12 3170 5120
14 3130 4970
16 3190 4960

Moreover, the minimum runtime decreases from ~12 us to <8 us.