lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
286 stars 94 forks source link

Investigate using only high precision for the solution vector in CG #114

Closed maddyscientist closed 10 years ago

maddyscientist commented 11 years ago

A technique used by Mathias Wagner is to keep the solution vector in CG always in high precision, this is supposed to help convergence, and may help especially with half precision:

e.g., x += alpha p,

where x is always in high precision and p is kept in low precision.

maddyscientist commented 11 years ago

I have tested this with staggered fermions, and I have found that keeping the solution vector in high precision vastly improves the irregular convergence we have seen in mixed-precision CG using half precision, and brings the half precision penalty down to the 10% level instead of the 20+% level previously observed. There is even a small but noticeable improvement for double-single solvers.

Actually deploying this algorithm in production requires some thought as to how to modify the blas kernels for mixed precision. It has been a conscious design choice to keep all the blas and reduce kernels uni-precision only and only have the copy kernels to mix precision; this being motivated to compilation times bearable. One solution would be to move certain key kernels in the copyKernel from the blasKernel, and keep the blasKernel uni-precision. Moving to a fully flexible mixed-precision model isn't feasible until we have JIT compile.

maddyscientist commented 11 years ago

It would be interesting to repeat this experiment for multi-shift CG as well: this may make mixed-precision multi-shift CG perform much better for the shifted systems where the accumulated error prevents convergence beyond a certain point.

maddyscientist commented 11 years ago

I further note that having a high-precision accumulator will also help domain-wall and twisted-mass formulations that also have to use CG, and so this probably pushes this work further up in priority.

maddyscientist commented 11 years ago

I just spent some time looking to see what the best solution to multiple precision blas kernels is. I has wondered if I could quickly hack a custom mixed-precision blas generator for the subset of kernels needed, e.g., axpyZpbxCuda axpyCuda xpyCuda axpyBzpcxCuda caxpypczpwCuda etc.

While this is possible, there are drawbacks:

  1. It's awkward since the high precision vector in each of these cases is a different argument number for each (e.g., 2nd arg for axpyCuda, but last arg for caxpypczpwCuda)
  2. Every time we change the algorithm for a different precision combination we will have to manually extent this

I am therefore thinking currently that we do deploy a full mixed-precision for all combinations for all kernels. In order to avoid the explosion in compilation time we will have to separate all the blas_quda.cu kernels into separate files and then each file can be built in parallel. This gives us maximum flexibility for deploying mixed precision algorithms and it would also solve the compilation time problem for the reduction kernels (by splitting into separate files).

Thus I propose that we create a lib/blas directory, and this would contain all the separated blas_quda.cu and reduce_quda.cu drivers.

What are your thoughts?

rbabich commented 11 years ago

Sounds like it might be a necessary evil. How many combinations are we talking about? If the generic blasCuda() allows 4 arguments, does that mean 3^4 = 81 combinations?

I wonder if we could put some intelligence into the structs where the device-side functors are defined. The idea would be to introduce (static const) members that specify the required precision combinations. You'd still have drawback # 2, but adding a precision combination would be a simple one-line change.

This would be logically similar to the writeX (etc.) template parameters that you already have. In fact, as a first step, you might bring those parameters into the struct as well.

maddyscientist commented 11 years ago

I think it's even worse than that because we build separate kernels for staggered and wilson in float and half precision. But since we'd have all the blas kernels in separate files it wouldn't be that bad in terms of compilation time, e.g. compare to the current reduction_quda.cu file, which defines 17 different kernels, with 3 precisions using 32 different templated block sizes which comes to 1632 distinct kernels compiled. A parallel build of the blas kernels really would not be that bad, though having a lot of memory would be definite plus.

I'm not sure I follow your suggestion though. By the structs you mean the Spinor class? The write parameter is already included into the Spinor class and do not appear explicitly in the generic kernel code. I think you are meaning something else though, perhaps can you write down an example of what you mean?

rbabich commented 11 years ago

I mean this guy (e.g., for axpby):

https://github.com/lattice/quda/blob/cc3a0bfac4bc52e2dd40a6ed3091b7e0d5ab01d2/lib/blas_quda.cu#L74 https://github.com/lattice/quda/blob/cc3a0bfac4bc52e2dd40a6ed3091b7e0d5ab01d2/lib/blas_quda.cu#L84

The "0,1,0,0" business tells blasCuda() which spinors are used for output. I'm suggesting adding more template parameters that determine which precision combinations are needed (with representation TBD). To avoid adding more unlabeled template parameters to the blasCuda() invocation, you might make them static const members of axpby, etc., but that's a nicety.

I'm not sure this is worth the trouble, but just a thought...

maddyscientist commented 10 years ago

This is now implemented in QUDA 0.7.