lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
279 stars 94 forks source link

make tune fails for Ns=4 in version ab7ef49bbaed8c19f55f60d339aa13886cc58075 #12

Closed bjoo closed 13 years ago

bjoo commented 13 years ago

I just pulled the master and tried to run make tune. The default tests/blas_test.cu has

// volume per GPU const int LX = 12; // Has to be checkerboarded value... (so 24->12) const int LY = 24; const int LZ = 24; const int LT = 24; const int Nspin = 1;

and 'make tune' works fine. Changing to Nspin=4; for wilson fermions results in

Testing single precision... QUDA error: (CUDA) too many resources requested for launch (node 0, blas_test.cu:118)

Now trying to get more information as to where the problem comes from (building with DEBUG_HOST and DEBUG_DEVICE) NB: I am building with -DMULTI_GPU -DOVERLAP_COMMS -DGPU_WILSON_DIRAC -DQMP_COMMS

bjoo commented 13 years ago

Hmm... I added a checkCudaError() after each kernel in the blas_test and I have the code now exit after this kernel: axpyZpbxCuda error = 1.873002e+00 (and I note the error is really big, even for 1/2 precision)

also removing the checkCudaError() and running through all the half prec tests I see that caxpbypzYmbwCuda error = 1.000635e+00,

The typical value fore 'error' for half precision is O(1e-4) - O(1e-5)

maddyscientist commented 13 years ago

This second problem you report is a bug you've introduced by removing the checkCudaError(). If the kernel fails to execute for any reason, then it will get the wrong answer. If you remove the failure check, then the wrong answer will be reported as it is doing.

What is the CUDA version and GPU you got this error on? I'm trying to reproduce it now.

maddyscientist commented 13 years ago

Ok, I managed to reproduce the issue on my Macbook Pro so it certainly wasn't a Fermi issue. I have just pushed changes which seem to fix the issue: the numerical error checks are not done now until after the tuning has successfully completed. This ensures that only working thread block and grid dimension parameters are used for the numerical checking.

Note, for ease of use, I've also changed the convention such that the user should input the full lattice dimensions, not the checker-boarded ones. This is stated as such in the code.