lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

Clover and Stouting Kernels #61

Closed bjoo closed 9 years ago

bjoo commented 12 years ago

Hi All, Mike asked me to point out what needs to be done for the clover program to work correctly on GPUs. -- Basically, what are the pieces that our 'outside of QDP++' that QDP-JIT cannot handle.

There are two essentially two source files in chroma that need to be re-worked as far as I am aware:

one is

chroma/lib/actions/ferm/linop/clover_term_qdp_w.h (http://git.jlab.org/cgi-bin/gitweb/gitweb.cgi?p=chroma.git;a=blob;f=lib/actions/ferm/linop/clover_term_qdp_w.h;h=8f80c954b77146d271c469dba4704a72953dde8d;hb=HEAD)

the other is

chroma/lib/util/gauge/stout_utils.cc (http://git.jlab.org/cgi-bin/gitweb/gitweb.cgi?p=chroma.git;a=blob;f=lib/util/gauge/stout_utils.cc;h=c605f4e7cb3479d87a2965730bd6c426becf2546;hb=HEAD)

The first one is the actual clover term implementation, the second one implements utilities used in stout smearing

In the clover_term_qdp_w.h we have a) A data structure for the cloverTerm ( template struct PrimitiveClovTriang ) which stores the clover term in 2 blocks. Each block has a real-diagonal part and a complex off diagonal part)

template struct PrimitiveClovTriang { RScalar diag[2][2*Nc]; RComplex offd[2][2_Nc_Nc-Nc]; };

for current QUDA use, we also have another structure

template struct QUDAPackedClovSite { R diag1[6]; R offDiag1[15][2]; R diag2[6]; R offDiag2[15][2]; }; which packs a site in the order expected by loadClover() in the QUDA lib. For a QUDA reimplementation of this file ( clover_term_quda.h ?) We could deprecate the PrimitiveClovTriang and work immediately in the QUDAPackedClovSite format.

The class QDPCloverTermT is templated on the lattice fermion type T, and the gauge field type U -- assuming that T is something like: LatticeDiracFermionF3, or LatticeFermion, LatticeFermionD etc, and U is LatticeColorMatrix, LatticeColorMatrixF, LatticeColorMatrixD etc.

Implements the following functions: variants of:

void create(... ) -- create the clover term, fill out the PrimitveClovTriangStruct, etc

void choles() - perform an inversion of the clover term using Cholesky (L^\dagger L) decomposiont void cholesDet() -- compute determinant of the clover term using the Cholesky decomposition void apply() -- apply the clover term to a Fermion void triacntr() -- compute Tr ( gamma_matrix Clover ) -- needed in forces

protected:

void makeClov() -- given the field strength tensor F, and a mass fill out the PrimitiveClovTriang term -- called by create() void chlclovms() -- the actual inversison by cholesky void ldagdlinv() -- inversion of the clover term by L^\dagger D L factorization (alternative to cholesky)

Essentially most of these codes have been threaded already, and so have been separated into an 'argument structure' and a 'siteLoop' function. These are usually enclosed in a separate namespace.

e.g. in the file clover_term_qdp_w.h one has

namespace QDPCloverEnv {

template struct QDPCloverMakeClovArg { typedef typename WordType::Type_t REALT; typedef OScalar< PScalar< PScalar< RScalar > > > RealT; const RealT& diag_mass; const U& f0; const U& f1; const U& f2; const U& f3; const U& f4; const U& f5; multi1d< PrimitiveClovTriang < REALT > >& tri; };

/* This is the extracted site loop for makeClover / template inline void makeClovSiteLoop(int lo, int hi, int myId, QDPCloverMakeClovArg a) { typedef typename QDPCloverMakeClovArg::RealT RealT; typedef typename QDPCloverMakeClovArg::REALT REALT;

const RealT& diag_mass = a->diag_mass; const U& f0=a->f0; const U& f1=a->f1; const U& f2=a->f2; const U& f3=a->f3; const U& f4=a->f4; const U& f5=a->f5; multi1d<PrimitiveClovTriang < REALT > >& tri=a->tri;

// SITE LOOP STARTS HERE for(int site = lo; site < hi; ++site) { /# Construct diagonal / ....

} // end site loop } // end function

The function makeClov just dispatches these:

/* This now just sets up and dispatches... */ template<typename T, typename U> void QDPCloverTermT<T,U>::makeClov(const multi1d& f, const RealT& diag_mass) { START_CODE();

if ( Nd != 4 ){ QDPIO::cerr << func << ": expecting Nd==4" << endl; QDP_abort(1); }

if ( Ns != 4 ){ QDPIO::cerr << func << ": expecting Ns==4" << endl; QDP_abort(1); }

// Set up the imputs of the argument structure U f0 = f[0] * getCloverCoeff(0,1); U f1 = f[1] * getCloverCoeff(0,2); U f2 = f[2] * getCloverCoeff(0,3); U f3 = f[3] * getCloverCoeff(1,2); U f4 = f[4] * getCloverCoeff(1,3); U f5 = f[5] * getCloverCoeff(2,3);

const int nodeSites = QDP::Layout::sitesOnNode();

tri.resize(nodeSites); // hold local lattice

// Set up the argument structure itself QDPCloverEnv::QDPCloverMakeClovArg arg = {diag_mass, f0,f1,f2,f3,f4,f5,tri };

// Dispatch to the QMT/OpenMP threading dispatch_to_threads(nodeSites, arg, QDPCloverEnv::makeClovSiteLoop);

END_CODE(); }

In principle this should be straightforward to QUDA-ise, since essentially the Site-loops would need to be turned into kernels (now each one working on just 1 site) -- Down side tho is that these kernels currently use QDP++/Chroma types (like U=LatticeColorMatrix) which would probably pollute QUDA excessively. I still think it is best to leave these kernels in Chroma personally.

That having been said I see the following 'siteLoops' in clover_term_qdp_w.h

QDPCloverMakeClovArg + makeClovSiteLoop + dispatch (line 360-516) LDagDLInvArgs + LDagDLInvSiteLoop + dispatch (line 570- 804) cholesSiteLoop + Dispatch (reusing LDagDLInvArgs) (line 817-991) TriaCntrArgs + triaCntrSiteLoop + dispatch (line 1154-1499) ApplyArgs + applySiteLoop + dispatch (line 1502- 2088) -- but lot of it ifdef-ed out

Then finally there is a site loop that repacks the PrimitiveClovTriang into the QUDAPackedClovSite form If the new versions just used the QUDAPackedClovSite form by default this could be eliminated. Currently this runs from

QUDAPackArgs + qudaPackSiteLoop + dispatch (line 2147-2158)

In terms of stout links there is only one 'site loop that I can see:

GetFsandBsArgs + getFsAndBsSiteLoop + dispatch lib/util/gauge/stout_utils.cc lines: 350- 847)

and that's it for now. There may be other dirty bits. If these proved to be awkward we'd have to work on them on a case by case basis.

Best, B

maddyscientist commented 12 years ago

A thought on the best way to implement the clover term. This is likely to be a very register-bound kernel, since the clover term consists of 72 reals per site. It can't really be split between chiral components, since the require parallel transports are the same for all spin components.

I would suggest that the clover term is computed in two steps:

  1. Compute the plaquette field. This is where all the expensive matrix multiplication takes place and only involves 9x9 complex matrix multiplication so will be less register bound.
  2. Assemble the clover term by adding up the required plaquettes at a given site. This step could possibly be parallelized over chirality?

Regarding Balint's comment above as to whether QUDA would be polluted. This will not be the case at all, there will be no QDP++ types in QUDA, since all we mean to do here is to replace the functionality if these computations with an interface call.

maddyscientist commented 12 years ago

Actually, a more sane approach would be:

  1. Compute the Fmunu field.
  2. Assemble clover term from Fmunu.

The comments in the CPS, actually make for very easy reading of how to implement the clover term:

//    Fermion action S = Sum_x_y { phi(x)^bar M_xy phi(y) }
//       where M_xy = I - kappa D,     for Wilson fermions
//                  = A - kappa D,     for Clover improved Wilson fermions
//       where A is local and we adopt the same definition as in 
//       Xiang-Qian Luo's paper.
//   
//    A = I - kappa * Csw / 2 * Sum_u<v { [gamma_u, gamma_v] * Fuv } 
//       where Fuv (here) = i Fuv (Luo),  and we choose
//             v = 0..3 as x,y,z,t
//             gamma_x_y_z = 0  sigma_x_y_z         gamma_t = 0  I 
//                        -sigma_x_y_z  0                     I  0 
//             sigma_x = 0 i    sigma_y = 0 -1     sigma_z = i  0
//                       i 0              1  0               0 -i
//             [sigma_i, sigma_j] = 2 epsilon_i_j_k sigma_k 
//   ===>      [gamma_i, gamma_j] =-2 epsilon_i_j_k (sigma_k    0)
//                                                  (0    sigma_k)
//             [gamma_i, gamma_t] = 2               (sigma_i    0)
//                                                  (0   -sigma_i)
//    A = I + kappa * Csw * (termB - termE    0            )
//                          (0               termB + termE )
//       where termB =  epsilon_ijk sigma_k Fij
//                   =  sigma_z Fxy + sigma_x Fyz - sigma_y Fxz
//                   = (  i Fxy                i Fyz + Fxz )
//                     (  i Fyz - Fxz        - i Fxy       )
//                   = (  i F01                i F12 + F02 )
//                     (  i F12 - F02        - i F01       )
//                   = (  i F01          ___  dagger       )
//                     (  i F12 - F02 __/    - i F01       )
//             termE = sigma_x Fxt + sigma_y Fyt + sigma_z Fzt
//                   = ( i Fzt            i Fxt - Fyt )
//                     ( i Fxt + Fyt      - i Fzt     )
//                   = ( i F23            dagger      )
//                     ( i F03 + F13      - i F23     )
//    A = A0   A1^dag  0    0
//        A1   A2      0    0
//        0    0      A3    A4^dag
//        0    0      A4    A5
//        where  A0 = I + kappa * Csw * i(F01 - F23)
//               A1 =     kappa * Csw * [(iF12-F02) - (iF03+F13)]
//               A2 = I - kappa * Csw * i(F01 - F23)
//               A3 = I + kappa * Csw * i(F01 + F23)
//               A4 =     kappa * Csw * [(iF12-F02) + (iF03+F13)]
//               A5 = I - kappa * Csw * i(F01 + F23)

How does this form compare to that used in Chroma and QOP?

rbabich commented 12 years ago

The QOPQDP implementation is in qopqdp-0.17.6/lib/wilson_dslash_p.c. It was written by Bugra, so I'm assuming that this is the same code as in MILC.

The function f_mu_nu() computes a single (mu, nu) component of F, using a total of 10 temporary color matrices in the process. It gets called a total of 6 times (once for each unique component) by get_clov(), which keeps around 5 temps.

Both functions are well-commented, but since they're written using QDP shifts and (probably as a result) a lot of temps, a direct port would perform terribly.

maddyscientist commented 12 years ago

For Fmunu I think this will be straight forward:

  1. First exchange the ghost zones for the gauge field like is done for fat link and force terms.
  2. Compute clover term on local sites only but using extended volume gauge field. This makes the indexing very easy.

The only issue (which is an immediate one) is that clover and dslash would use different gauge field ghost zone storage (extended versus padded). This highlights another reason why we need to get extended field support for the dslashes.

maddyscientist commented 12 years ago

Moving this to 0.5 since the need for this is less desperate given that Chroma can do these computations on the GPU using qdp/jit.

maddyscientist commented 10 years ago

Note that the clover term creation has now been completed, and is multi-GPU capable, this is presently in the quda-0.7 branch). The issue regarding stouting kernels remains open however.

maddyscientist commented 9 years ago

I'm going to close this issue since we have all the clover functionality, and create a separate issue to track the implementation of stout link creation.