lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
279 stars 94 forks source link

Implement extended-precision arithmetic in the hisq link unitarization #137

Open jpfoley opened 10 years ago

jpfoley commented 10 years ago

At the moment, the calculation of the hisq links in QUDA copies the approach taken in the MILC CPU code. Link unitarization requires the calculation of the inverse square root of the fattened link. This can be done using the Hamilton-Cayley theorem if the fat link is sufficiently well conditioned. For ill-conditioned links, the MILC code (and currently QUDA) switches to SVD. The problem with using SVD in QUDA is that it contains many branches and if different threads call SVD individually the result is horrible thread divergence and serialisation. Note, that this is probably only ever an issue on extremely rough configurations, but it is a potential performance killer.

A more optimal way of dealing with near-singular links would be to use Hamilton-Cayley with extended-precision arithmetic. As Hamilton-Cayley is deterministic, we won't have thread divergence, and near-singular matrices could be handled by introducing a cutoff parameter that is smaller than double-precision machine epsilon.

Comments and suggestions?

maddyscientist commented 10 years ago

Sounds like a good idea. On my todo list is to get double-double precision incorporated into QUDA, this would be a nice simple place to slot it in I guess.

mathiaswagner commented 10 years ago

I like the idea. Justin, I have not looked it up but do we need anything else than the basic operations and sqrt / rqsqrt for Hamilton-Cayley? The 'double-double' package only implements these right now.

Apart from that, do you have any numbers on the performance impact? I considered the projection part as rather negligible. How do you implement it in QUDA? I just run normal inverse sqrt code for all links and store a boolen flag whether SVD is needed for a link in an array of size number of links. We typically see only a few links that need SVD (< 1%). I then launch an SVD Kernel that immediately returns if the SVD flags is false for that link. Assuming that for most warps that is the case for all threads it does not hurt too bad. Another idea is to use atomic increment to get a unique index for each link that needs SVD. Then just store the link index at the atomic index and you only have to run a kernel with the number of threads that actually need SVD. Probably that won't perform great either since the number of threads will be typically very small.

So, again, I am not sure whether the impact of using double-double precision for all links (especially on Fermi with only 63 registers) is less than launching a separate SVD Kernel for a usually small number of links. I would suggest to get some performance numbers for the current approach (including a bad case with lot of SVD required) and eventually then implementing Hamilton-Cayley in double precision. The latter is probably not good for use, but gives you a performance estimate for Hamilton-Cayley in extended precision. Assuming a factor 2 slower than native double precision should be a valid assumption.

Still, the general possibility of extended precision is appealing, I am just not sure whether the performance of the U(3) Projection is worth the pain.

detar commented 10 years ago

Hi Justin,

Is Alexei Bazavov on your github list? He is the expert and can comment on your strategy.

Best, Carleton

On 5/8/2014 3:53 PM, Justin Foley wrote:

At the moment, the calculation of the hisq links in QUDA copies the approach taken in the MILC CPU code. Link unitarization requires the calculation of the inverse square root of the fattened link. This can be done using the Hamilton-Cayley theorem if the fat link is sufficiently well conditioned. For ill-conditioned links, the MILC code (and currently QUDA) switches to SVD. The problem with using SVD in QUDA is that it contains many branches and if different threads call SVD individually the result is horrible thread divergence and serialisation. Note, that this is probably only ever an issue on extremely rough configurations, but it is a potential performance killer.

A more optimal way of dealing with near-singular links would be to use Hamilton-Cayley with extended-precision arithmetic. As Hamilton-Cayley is deterministic, we won't have thread divergence, and near-singular matrices could be handled by introducing a cutoff parameter that is smaller than double-precision machine epsilon.

Comments and suggestions?

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/137.