Open maddyscientist opened 12 years ago
Assigning myself, but I'm a little worried about this: "Launching a kernel with a different preference than the most recent preference setting may insert a device-side synchronization point" [1]. This could kill multi-GPU performance without the autotuner noticing, since we now tune on a per-kernel basis (rather than the full Dslash).
This is a good point. We only want the same cache configuration for the entire dslash. How about only allowing the cache configuration to be altered for the interior kernel, with the other kernels (packing, exterior) overriding this to do nothing?
Sounds like a plan.
Currently we set the cache configuration to 48K L1 and 16 K shared (Fermi). However, this isn't optimal for all kernels and the auto tuner can actually switch the default cache configuration if it requests more than 16K per SM.
The solution is expand the TuneParam class to include a member variable enum cudaFuncCache, which will be tuned per kernel. This shouldn't be too much work, adding it to the 0.4.1 milestone.....