Clover and Stouting Kernels

Hi All, Mike asked me to point out what needs to be done for the clover program to work correctly on GPUs. -- Basically, what are the pieces that our 'outside of QDP++' that QDP-JIT cannot handle.

There are two essentially two source files in chroma that need to be re-worked as far as I am aware:

one is

chroma/lib/actions/ferm/linop/clover_term_qdp_w.h (http://git.jlab.org/cgi-bin/gitweb/gitweb.cgi?p=chroma.git;a=blob;f=lib/actions/ferm/linop/clover_term_qdp_w.h;h=8f80c954b77146d271c469dba4704a72953dde8d;hb=HEAD)

the other is

chroma/lib/util/gauge/stout_utils.cc (http://git.jlab.org/cgi-bin/gitweb/gitweb.cgi?p=chroma.git;a=blob;f=lib/util/gauge/stout_utils.cc;h=c605f4e7cb3479d87a2965730bd6c426becf2546;hb=HEAD)

The first one is the actual clover term implementation, the second one implements utilities used in stout smearing

In the clover_term_qdp_w.h we have a) A data structure for the cloverTerm ( template struct PrimitiveClovTriang ) which stores the clover term in 2 blocks. Each block has a real-diagonal part and a complex off diagonal part)

template struct PrimitiveClovTriang { RScalar diag[2][2*Nc]; RComplex offd[2][2_Nc_Nc-Nc]; };

for current QUDA use, we also have another structure

template struct QUDAPackedClovSite { R diag1[6]; R offDiag1[15][2]; R diag2[6]; R offDiag2[15][2]; }; which packs a site in the order expected by loadClover() in the QUDA lib. For a QUDA reimplementation of this file ( clover_term_quda.h ?) We could deprecate the PrimitiveClovTriang and work immediately in the QUDAPackedClovSite format.

The class QDPCloverTermT is templated on the lattice fermion type T, and the gauge field type U -- assuming that T is something like: LatticeDiracFermionF3, or LatticeFermion, LatticeFermionD etc, and U is LatticeColorMatrix, LatticeColorMatrixF, LatticeColorMatrixD etc.

Implements the following functions: variants of:

void create(... ) -- create the clover term, fill out the PrimitveClovTriangStruct, etc

void choles() - perform an inversion of the clover term using Cholesky (L^\dagger L) decomposiont void cholesDet() -- compute determinant of the clover term using the Cholesky decomposition void apply() -- apply the clover term to a Fermion void triacntr() -- compute Tr ( gamma_matrix Clover ) -- needed in forces

protected:

void makeClov() -- given the field strength tensor F, and a mass fill out the PrimitiveClovTriang term -- called by create() void chlclovms() -- the actual inversison by cholesky void ldagdlinv() -- inversion of the clover term by L^\dagger D L factorization (alternative to cholesky)

Essentially most of these codes have been threaded already, and so have been separated into an 'argument structure' and a 'siteLoop' function. These are usually enclosed in a separate namespace.

e.g. in the file clover_term_qdp_w.h one has

namespace QDPCloverEnv {

template struct QDPCloverMakeClovArg { typedef typename WordType::Type_t REALT; typedef OScalar< PScalar< PScalar< RScalar > > > RealT; const RealT& diag_mass; const U& f0; const U& f1; const U& f2; const U& f3; const U& f4; const U& f5; multi1d< PrimitiveClovTriang < REALT > >& tri; };

/* This is the extracted site loop for makeClover / template inline void makeClovSiteLoop(int lo, int hi, int myId, QDPCloverMakeClovArg a) { typedef typename QDPCloverMakeClovArg::RealT RealT; typedef typename QDPCloverMakeClovArg::REALT REALT;

const RealT& diag_mass = a->diag_mass; const U& f0=a->f0; const U& f1=a->f1; const U& f2=a->f2; const U& f3=a->f3; const U& f4=a->f4; const U& f5=a->f5; multi1d<PrimitiveClovTriang < REALT > >& tri=a->tri;

// SITE LOOP STARTS HERE for(int site = lo; site < hi; ++site) { /# Construct diagonal / ....

} // end site loop } // end function

The function makeClov just dispatches these:

/* This now just sets up and dispatches... */ template<typename T, typename U> void QDPCloverTermT<T,U>::makeClov(const multi1d& f, const RealT& diag_mass) { START_CODE();

if ( Nd != 4 ){ QDPIO::cerr << func << ": expecting Nd==4" << endl; QDP_abort(1); }

if ( Ns != 4 ){ QDPIO::cerr << func << ": expecting Ns==4" << endl; QDP_abort(1); }

// Set up the imputs of the argument structure U f0 = f[0] * getCloverCoeff(0,1); U f1 = f[1] * getCloverCoeff(0,2); U f2 = f[2] * getCloverCoeff(0,3); U f3 = f[3] * getCloverCoeff(1,2); U f4 = f[4] * getCloverCoeff(1,3); U f5 = f[5] * getCloverCoeff(2,3);

const int nodeSites = QDP::Layout::sitesOnNode();

tri.resize(nodeSites); // hold local lattice

// Set up the argument structure itself QDPCloverEnv::QDPCloverMakeClovArg arg = {diag_mass, f0,f1,f2,f3,f4,f5,tri };

// Dispatch to the QMT/OpenMP threading dispatch_to_threads(nodeSites, arg, QDPCloverEnv::makeClovSiteLoop);

END_CODE(); }

In principle this should be straightforward to QUDA-ise, since essentially the Site-loops would need to be turned into kernels (now each one working on just 1 site) -- Down side tho is that these kernels currently use QDP++/Chroma types (like U=LatticeColorMatrix) which would probably pollute QUDA excessively. I still think it is best to leave these kernels in Chroma personally.

That having been said I see the following 'siteLoops' in clover_term_qdp_w.h

QDPCloverMakeClovArg + makeClovSiteLoop + dispatch (line 360-516) LDagDLInvArgs + LDagDLInvSiteLoop + dispatch (line 570- 804) cholesSiteLoop + Dispatch (reusing LDagDLInvArgs) (line 817-991) TriaCntrArgs + triaCntrSiteLoop + dispatch (line 1154-1499) ApplyArgs + applySiteLoop + dispatch (line 1502- 2088) -- but lot of it ifdef-ed out

Then finally there is a site loop that repacks the PrimitiveClovTriang into the QUDAPackedClovSite form If the new versions just used the QUDAPackedClovSite form by default this could be eliminated. Currently this runs from

QUDAPackArgs + qudaPackSiteLoop + dispatch (line 2147-2158)

In terms of stout links there is only one 'site loop that I can see:

GetFsandBsArgs + getFsAndBsSiteLoop + dispatch lib/util/gauge/stout_utils.cc lines: 350- 847)

and that's it for now. There may be other dirty bits. If these proved to be awkward we'd have to work on them on a case by case basis.

Best, B

Actually, a more sane approach would be:

Compute the Fmunu field.
Assemble clover term from Fmunu.

The comments in the CPS, actually make for very easy reading of how to implement the clover term:

//    Fermion action S = Sum_x_y { phi(x)^bar M_xy phi(y) }
//       where M_xy = I - kappa D,     for Wilson fermions
//                  = A - kappa D,     for Clover improved Wilson fermions
//       where A is local and we adopt the same definition as in 
//       Xiang-Qian Luo's paper.
//   
//    A = I - kappa * Csw / 2 * Sum_u<v { [gamma_u, gamma_v] * Fuv } 
//       where Fuv (here) = i Fuv (Luo),  and we choose
//             v = 0..3 as x,y,z,t
//             gamma_x_y_z = 0  sigma_x_y_z         gamma_t = 0  I 
//                        -sigma_x_y_z  0                     I  0 
//             sigma_x = 0 i    sigma_y = 0 -1     sigma_z = i  0
//                       i 0              1  0               0 -i
//             [sigma_i, sigma_j] = 2 epsilon_i_j_k sigma_k 
//   ===>      [gamma_i, gamma_j] =-2 epsilon_i_j_k (sigma_k    0)
//                                                  (0    sigma_k)
//             [gamma_i, gamma_t] = 2               (sigma_i    0)
//                                                  (0   -sigma_i)
//    A = I + kappa * Csw * (termB - termE    0            )
//                          (0               termB + termE )
//       where termB =  epsilon_ijk sigma_k Fij
//                   =  sigma_z Fxy + sigma_x Fyz - sigma_y Fxz
//                   = (  i Fxy                i Fyz + Fxz )
//                     (  i Fyz - Fxz        - i Fxy       )
//                   = (  i F01                i F12 + F02 )
//                     (  i F12 - F02        - i F01       )
//                   = (  i F01          ___  dagger       )
//                     (  i F12 - F02 __/    - i F01       )
//             termE = sigma_x Fxt + sigma_y Fyt + sigma_z Fzt
//                   = ( i Fzt            i Fxt - Fyt )
//                     ( i Fxt + Fyt      - i Fzt     )
//                   = ( i F23            dagger      )
//                     ( i F03 + F13      - i F23     )
//    A = A0   A1^dag  0    0
//        A1   A2      0    0
//        0    0      A3    A4^dag
//        0    0      A4    A5
//        where  A0 = I + kappa * Csw * i(F01 - F23)
//               A1 =     kappa * Csw * [(iF12-F02) - (iF03+F13)]
//               A2 = I - kappa * Csw * i(F01 - F23)
//               A3 = I + kappa * Csw * i(F01 + F23)
//               A4 =     kappa * Csw * [(iF12-F02) + (iF03+F13)]
//               A5 = I - kappa * Csw * i(F01 + F23)

How does this form compare to that used in Chroma and QOP?

lattice / quda

Clover and Stouting Kernels #61