Auto tuner should tune for best cache configuration

lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.

https://lattice.github.io/quda

Other

293 stars 99 forks source link

Auto tuner should tune for best cache configuration #49

Open maddyscientist opened 12 years ago

maddyscientist commented 12 years ago

Currently we set the cache configuration to 48K L1 and 16 K shared (Fermi). However, this isn't optimal for all kernels and the auto tuner can actually switch the default cache configuration if it requests more than 16K per SM.

The solution is expand the TuneParam class to include a member variable enum cudaFuncCache, which will be tuned per kernel. This shouldn't be too much work, adding it to the 0.4.1 milestone.....

rbabich commented 12 years ago

Assigning myself, but I'm a little worried about this: "Launching a kernel with a different preference than the most recent preference setting may insert a device-side synchronization point" [1]. This could kill multi-GPU performance without the autotuner noticing, since we now tune on a per-kernel basis (rather than the full Dslash).

[1] http://developer.download.nvidia.com/compute/cuda/4\_1/rel/toolkit/docs/online/group\_\_CUDART\_\_HIGHLEVEL\_ge0969184de8a5c2d809aa8d7d2425484.html#ge0969184de8a5c2d809aa8d7d2425484

maddyscientist commented 12 years ago

This is a good point. We only want the same cache configuration for the entire dslash. How about only allowing the cache configuration to be altered for the interior kernel, with the other kernels (packing, exterior) overriding this to do nothing?

rbabich commented 12 years ago

Sounds like a plan.