Closed bjoo closed 13 years ago
Hmm... I added a checkCudaError() after each kernel in the blas_test and I have the code now exit after this kernel: axpyZpbxCuda error = 1.873002e+00 (and I note the error is really big, even for 1/2 precision)
also removing the checkCudaError() and running through all the half prec tests I see that caxpbypzYmbwCuda error = 1.000635e+00,
The typical value fore 'error' for half precision is O(1e-4) - O(1e-5)
This second problem you report is a bug you've introduced by removing the checkCudaError(). If the kernel fails to execute for any reason, then it will get the wrong answer. If you remove the failure check, then the wrong answer will be reported as it is doing.
What is the CUDA version and GPU you got this error on? I'm trying to reproduce it now.
Ok, I managed to reproduce the issue on my Macbook Pro so it certainly wasn't a Fermi issue. I have just pushed changes which seem to fix the issue: the numerical error checks are not done now until after the tuning has successfully completed. This ensures that only working thread block and grid dimension parameters are used for the numerical checking.
Note, for ease of use, I've also changed the convention such that the user should input the full lattice dimensions, not the checker-boarded ones. This is stated as such in the code.
I just pulled the master and tried to run make tune. The default tests/blas_test.cu has
// volume per GPU const int LX = 12; // Has to be checkerboarded value... (so 24->12) const int LY = 24; const int LZ = 24; const int LT = 24; const int Nspin = 1;
and 'make tune' works fine. Changing to Nspin=4; for wilson fermions results in
Testing single precision... QUDA error: (CUDA) too many resources requested for launch (node 0, blas_test.cu:118)
Now trying to get more information as to where the problem comes from (building with DEBUG_HOST and DEBUG_DEVICE) NB: I am building with -DMULTI_GPU -DOVERLAP_COMMS -DGPU_WILSON_DIRAC -DQMP_COMMS