Auto-tuning framework changes

The auto-tuning framework has to be modified for CUDA 6.5 to allow for compatibility with future GPUs. The present auto tuner tests all launch configurations, regardless of whether they are valid or not. This seems to cause problems on an unreleased GPU, so I am going to modify the framework such that these invalid launches are skipped.

To do so, we will utilize the occupancy calculator API that is included with CUDA 6.5: the function cudaOccupancyMaxPotentialBlockSize will compute maximum possible block size for a given kernel (e.g., given the number of registers it consumes). For the moment, I think the easiest way to do this is to have all derived classes of Tunable define a new method that computes this. This will then be used when tuning to ensure this limit is not exceeded. E.g., here's the code I presently use for blasKernel"

 int maxThreadsPerBlock() const {
   int minGridSize, maxBlockSize;
   cudaOccupancyMaxPotentialBlockSize(&minGridSize, &maxBlockSize, blasKernel<FloatN,M,SpinorX,SpinorY,SpinorZ,SpinorW,Functor>, 0, 0);
   return maxBlockSize;
 }

I will make a global edit on all the derived classes from Tunable next week, implementing this. At which point all branches that relate to the quda-0.7 branch should be updated. This will be a design rule that all classes will have to obey.

lattice / quda

Auto-tuning framework changes #145