lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
286 stars 94 forks source link

Fix Blas Autotuning #11

Closed bjoo closed 12 years ago

bjoo commented 13 years ago

Hi, a user was trying to run QUDA and came accross this error:

(CUDA) too many resources requested for launch (node 0, blas_quda.cu:929)

He was trying to run a 16^4 clover lattice on a single C2050 (but in Multi-GPU mode - ie with wraparound QMP comms).

Probably the blas_params are not optimal (he probably did not run a make tune -- cos my script that I gave him did not have that in), and he used the default blas_params.h file. The curiosity is that for me on qcd10g0310, also with C2050-s this error does not occur when I try to emulate what he is doing (I used exactly the same package for the build that I gave him.)

However, I am using CUDA3.0 and he's using 3.2.

I think in principle, a make tune could fix his problem but that makes automation really quite difficult. (Need to know/edit lattice size in blas_test, and have to do it interactively / submit a job to a compute node for systems where there is no GPU on the interactive node).

Any ideas? Can it be done at runtime without having to wait the 15 minutes for the full BLAS tuning to go through like with make tune?

maddyscientist commented 13 years ago

This can be done at runtime, if: 1.) we only tune the kernels that we need 2.) perform the tuning when the inverter is first created, and keep the results resident after that

This is something I'm thinking about, and will work on this as soon as we have the multi-dim parallelization in shape.

maddyscientist commented 13 years ago

A partial fix for this would be to add command line setting of the volumes and spin to blas_test.cu. Propagating this to "make tune" would enable much easier blas tuning, e.g.,

make tune 16 16 16 16 4

would perform a tuning run on a 16^4 lattice, for Npsin = 4.

maddyscientist commented 13 years ago

Ron has proposed that we create cached tuned blas files. If one runs at a certain volume that has already been tuned, then this will be reused, else some fallback parameters will be used that are guaranteed to work regardless of volume. This seems to me like a an easy solution, and will drastically reduce the number of "make tunes" that are needed.