Optimize link unitarization

maddyscientist commented 8 years ago

The link unitarization has one of the largest performance regressions when running on Maxwell, which doesn't have fast double-precision capability. Benchmarking reveals that the bulk of the time is spent in the reciprocalRoot function.

  // Also modify q if the eigenvalues are dangerously small.                                                                                                                                                                                                                  
  template<class Cmplx>
  __device__  __host__
  bool reciprocalRoot(const Matrix<Cmplx,3>& q, Matrix<Cmplx,3>* res){

    Matrix<Cmplx,3> qsq, tempq;

    typename RealTypeId<Cmplx>::Type c[3];
    typename RealTypeId<Cmplx>::Type g[3];

    const typename RealTypeId<Cmplx>::Type one_third = 0.333333333333333333333;
    const typename RealTypeId<Cmplx>::Type one_ninth = 0.111111111111111111111;
    const typename RealTypeId<Cmplx>::Type one_eighteenth = 0.055555555555555555555;

    qsq = q*q;
    tempq = qsq*q;

    c[0] = getTrace(q).x;
    c[1] = getTrace(qsq).x * 0.5;
    c[2] = getTrace(tempq).x * one_third;;

    g[0] = g[1] = g[2] = c[0] * one_third;
    typename RealTypeId<Cmplx>::Type r,s,theta;
    s = c[1]*one_third - c[0]*c[0]*one_eighteenth;

    typename RealTypeId<Cmplx>::Type cosTheta;
    if(fabs(s) >= FL_UNITARIZE_EPS){ // faster when this conditional is removed?                                                                                                                                                                                              
      const typename RealTypeId<Cmplx>::Type rsqrt_s = rsqrt(s);
      r = c[2]*0.5 - (c[0]*one_third)*(c[1] - c[0]*c[0]*one_ninth);
      cosTheta = r*rsqrt_s*rsqrt_s*rsqrt_s;

      if(fabs(cosTheta) >= 1.0){
        theta = (r > 0) ? 0.0 : FL_UNITARIZE_PI;
      }else{
        theta = acos(cosTheta); // this is the primary performance limiter                                                                                                                                                                                                    
      }

      const typename RealTypeId<Cmplx>::Type sqrt_s = s*rsqrt_s;

      typename RealTypeId<Cmplx>::Type as, ac;
      sincos( theta*one_third, &as, &ac );
      g[0] = c[0]*one_third + 2*sqrt_s*ac;
      g[1] = c[0]*one_third - 2*sqrt_s*(0.5*ac + as*0.8660254037844386467637);
      g[2] = c[0]*one_third + 2*sqrt_s*(-0.5*ac + as*0.8660254037844386467637);

    }

    // Check the eigenvalues, if the determinant does not match the product of the eigenvalues                                                                                                                                                                                
    // return false. Then call SVD instead.                                                                                                                                                                                                                                   
    typename RealTypeId<Cmplx>::Type det = getDeterminant(q).x;
    if( fabs(det) < FL_REUNIT_SVD_ABS_ERROR ) return false;
    if( checkRelativeError(g[0]*g[1]*g[2],det,FL_REUNIT_SVD_REL_ERROR) == false ) return false;

    // At this point we have finished with the c's                                                                                                                                                                                                                            
    // use these to store sqrt(g)                                                                                                                                                                                                                                             
    for(int i=0; i<3; ++i) c[i] = sqrt(g[i]);

    // done with the g's, use these to store u, v, w                                                                                                                                                                                                                          
    g[0] = c[0]+c[1]+c[2];
    g[1] = c[0]*c[1] + c[0]*c[2] + c[1]*c[2];
    g[2] = c[0]*c[1]*c[2];

    const typename RealTypeId<Cmplx>::Type & denominator  = 1.0 / ( g[2]*(g[0]*g[1]-g[2]) );
    c[0] = (g[0]*g[1]*g[1] - g[2]*(g[0]*g[0]+g[1])) * denominator;
    c[1] = (-g[0]*g[0]*g[0] - g[2] + 2.*g[0]*g[1]) * denominator;
    c[2] =  g[0] * denominator;

    tempq = c[1]*q + c[2]*qsq;
    // Add a real scalar                                                                                                                                                                                                                                                      
    tempq(0,0).x += c[0];
    tempq(1,1).x += c[0];
    tempq(2,2).x += c[0];

    *res = tempq;

    return true;
  }

The above is an experimental version I have that replaces the 3 cos calls with a single call to sincos. The performance is dominated by the acos call, where is it seen on GM200 that performance doubles if the acos call is removed (130->260 GFLOPS).

It would be nice if we had an alternative algorithm or reformulation that simplified this computation. Link unitarization can otherwise be a bottleneck on Maxwell GPUs when running HISQ link generation.

AlexVaq commented 8 years ago

Is this a projection back to su(3) after some smearing? In the APE smearing code I was using an iterative method that maybe you find easier to optimize (although I think it will turn out the other way, that you can optimize APE smearing with that code).

El 12/8/2015, a las 20:29, mikeaclark notifications@github.com escribió:

The link unitarization has one of the largest performance regressions when running on Maxwell, which doesn't have fast double-precision capability. Benchmarking reveals that the bulk of the time is spent in the reciprocalRoot function.

// Also modify q if the eigenvalues are dangerously small.
template device host bool reciprocalRoot(const Matrix<Cmplx,3>& q, Matrix<Cmplx,3>* res){

Matrix<Cmplx,3> qsq, tempq;

typename RealTypeId<Cmplx>::Type c[3];
typename RealTypeId<Cmplx>::Type g[3];

const typename RealTypeId<Cmplx>::Type one_third = 0.333333333333333333333;
const typename RealTypeId<Cmplx>::Type one_ninth = 0.111111111111111111111;
const typename RealTypeId<Cmplx>::Type one_eighteenth = 0.055555555555555555555;

qsq = q*q;
tempq = qsq*q;

c[0] = getTrace(q).x;
c[1] = getTrace(qsq).x * 0.5;
c[2] = getTrace(tempq).x * one_third;;

g[0] = g[1] = g[2] = c[0] * one_third;
typename RealTypeId<Cmplx>::Type r,s,theta;
s = c[1]*one_third - c[0]*c[0]*one_eighteenth;

typename RealTypeId<Cmplx>::Type cosTheta;
if(fabs(s) >= FL_UNITARIZE_EPS){ // faster when this conditional is removed?                                                                                                                                                                                              
  const typename RealTypeId<Cmplx>::Type rsqrt_s = rsqrt(s);
  r = c[2]*0.5 - (c[0]*one_third)*(c[1] - c[0]*c[0]*one_ninth);
  cosTheta = r*rsqrt_s*rsqrt_s*rsqrt_s;

  if(fabs(cosTheta) >= 1.0){
    theta = (r > 0) ? 0.0 : FL_UNITARIZE_PI;
  }else{
    theta = acos(cosTheta); // this is the primary performance limiter                                                                                                                                                                                                    
  }

  const typename RealTypeId<Cmplx>::Type sqrt_s = s*rsqrt_s;

  typename RealTypeId<Cmplx>::Type as, ac;
  sincos( theta*one_third, &as, &ac );
  g[0] = c[0]*one_third + 2*sqrt_s*ac;
  g[1] = c[0]*one_third - 2*sqrt_s*(0.5*ac + as*0.8660254037844386467637);
  g[2] = c[0]*one_third + 2*sqrt_s*(-0.5*ac + as*0.8660254037844386467637);

}

// Check the eigenvalues, if the determinant does not match the product of the eigenvalues                                                                                                                                                                                
// return false. Then call SVD instead.                                                                                                                                                                                                                                   
typename RealTypeId<Cmplx>::Type det = getDeterminant(q).x;
if( fabs(det) < FL_REUNIT_SVD_ABS_ERROR ) return false;
if( checkRelativeError(g[0]*g[1]*g[2],det,FL_REUNIT_SVD_REL_ERROR) == false ) return false;

// At this point we have finished with the c's                                                                                                                                                                                                                            
// use these to store sqrt(g)                                                                                                                                                                                                                                             
for(int i=0; i<3; ++i) c[i] = sqrt(g[i]);

// done with the g's, use these to store u, v, w                                                                                                                                                                                                                          
g[0] = c[0]+c[1]+c[2];
g[1] = c[0]*c[1] + c[0]*c[2] + c[1]*c[2];
g[2] = c[0]*c[1]*c[2];

const typename RealTypeId<Cmplx>::Type & denominator  = 1.0 / ( g[2]*(g[0]*g[1]-g[2]) );
c[0] = (g[0]*g[1]*g[1] - g[2]*(g[0]*g[0]+g[1])) * denominator;
c[1] = (-g[0]*g[0]*g[0] - g[2] + 2.*g[0]*g[1]) * denominator;
c[2] =  g[0] * denominator;

tempq = c[1]*q + c[2]*qsq;
// Add a real scalar                                                                                                                                                                                                                                                      
tempq(0,0).x += c[0];
tempq(1,1).x += c[0];
tempq(2,2).x += c[0];

*res = tempq;

return true;

} The above is an experimental version I have that replaces the 3 cos calls with a single call to sincos. The performance is dominated by the acos call, where is it seen on GM200 that performance doubles if the acos call is removed (130->260 GFLOPS).

It would be nice if we had an alternative algorithm or reformulation that simplified this computation. Link unitarization can otherwise be a bottleneck on Maxwell GPUs when running HISQ link generation.

— Reply to this email directly or view it on GitHub.

mathiaswagner commented 8 years ago

Yes, this is a projection to U(3).

AlexVaq commented 8 years ago

Well, I project to su3, but if you want to have a look at the code, it is in gauge_ape.cu, the polarSu3 function or sth like that.

Don't expect great performance, when I wrote this I was mostly interested in the functionality, but it might be optimizable...

El 12/8/2015, a las 20:53, Mathias Wagner notifications@github.com escribió:

Yes, this is a projection to U(3).

— Reply to this email directly or view it on GitHub.

mathiaswagner commented 8 years ago

The formulas in the code are given e.g. in https://inspirehep.net/record/850762?ln=en and this is what comes from the MILC code. I am not quite sure but there might be alternative implementations also in the MILC code for the U3 projection. We can check there.

maddyscientist commented 8 years ago

From the MILC paper, one can see that the acos call comes from the analytic solution of the cubic characteristic equation. I suspect it would be much faster if we replaced this analytic solution with a Newton Raphson method to solve this. This would have the advantage of being iterative and allow the first few iterations to be done in single precision, with an initial guess to the solutions given by using the analytic solution computed in single precision.

Thoughts?

mathiaswagner commented 8 years ago

That might be worth trying. I know that the performance of the projection is quite bad on Maxwell but in the overall picture it should hopefully still be fast ... How large is the fraction spend in the projection compared to smearing (not at all talking about PCI and data reordering ...), I don't have any reliable timings for this at hand.

maddyscientist commented 8 years ago

I should havebenchmarks on this later today.

maddyscientist commented 8 years ago

Also, I believe an iterative solution to the characteristic equation will be faster irrespective on all GPUs, since it would allow us to use the hardware transcendentals for sincos. Even on fast-DP GPUs, transcendentals are done using nasty branchy code that use local memory.

mathiaswagner commented 8 years ago

From my MILC RHMC runs on K20s:

      computeKSLinkQuda Total time = 14.7233 secs
              download     = 3.181779 secs (  21.6%), with       12 calls at 2.651482e+05 us per call
                upload     = 7.077134 secs (  48.1%), with       24 calls at 2.948806e+05 us per call
                  init     = 0.028404 secs ( 0.193%), with       42 calls at 6.762857e+02 us per call
               compute     = 3.060089 secs (  20.8%), with       18 calls at 1.700049e+05 us per call
                  free     = 0.019482 secs ( 0.132%), with       24 calls at 8.117500e+02 us per call
           comms start     = 1.330893 secs (  9.04%), with       12 calls at 1.109077e+05 us per call
                 total     = 0.024748 secs ( 0.168%), with       12 calls at 2.062333e+03 us per call
     total accounted       = 14.722529 secs (   100%)
     total missing         = 0.000743 secs (0.00505%)

and total

      initQuda-endQuda Total time = 1314.63 secs

                   QUDA Total time = 1165.26 secs
              download     = 77.983407 secs (  6.69%), with      418 calls at 1.865632e+05 us per call
                upload     = 17.576121 secs (  1.51%), with      138 calls at 1.273632e+05 us per call
                  init     = 9.166846 secs ( 0.787%), with      683 calls at 1.342144e+04 us per call
              preamble     = 0.000179 secs (1.54e-05%), with      295 calls at 6.067797e-01 us per call
               compute     = 1047.716922 secs (  89.9%), with      631 calls at 1.660407e+06 us per call
              epilogue     = 4.409482 secs ( 0.378%), with      295 calls at 1.494740e+04 us per call
                  free     = 0.424351 secs (0.0364%), with      471 calls at 9.009575e+02 us per call

And from a profiling output for the HISQ smearing (just the first level of smearing that includes the projection). I hope I got that right from just looking in the profiler.

screen shot 2015-08-12 at 15 38 15

But this again was on Kepler (K40 this time)

mathiaswagner commented 8 years ago

But I don't want to discourage you from looking at this.

lattice / quda

Optimize link unitarization #350