Closed bjoo closed 9 years ago
A thought on the best way to implement the clover term. This is likely to be a very register-bound kernel, since the clover term consists of 72 reals per site. It can't really be split between chiral components, since the require parallel transports are the same for all spin components.
I would suggest that the clover term is computed in two steps:
Regarding Balint's comment above as to whether QUDA would be polluted. This will not be the case at all, there will be no QDP++ types in QUDA, since all we mean to do here is to replace the functionality if these computations with an interface call.
Actually, a more sane approach would be:
The comments in the CPS, actually make for very easy reading of how to implement the clover term:
// Fermion action S = Sum_x_y { phi(x)^bar M_xy phi(y) }
// where M_xy = I - kappa D, for Wilson fermions
// = A - kappa D, for Clover improved Wilson fermions
// where A is local and we adopt the same definition as in
// Xiang-Qian Luo's paper.
//
// A = I - kappa * Csw / 2 * Sum_u<v { [gamma_u, gamma_v] * Fuv }
// where Fuv (here) = i Fuv (Luo), and we choose
// v = 0..3 as x,y,z,t
// gamma_x_y_z = 0 sigma_x_y_z gamma_t = 0 I
// -sigma_x_y_z 0 I 0
// sigma_x = 0 i sigma_y = 0 -1 sigma_z = i 0
// i 0 1 0 0 -i
// [sigma_i, sigma_j] = 2 epsilon_i_j_k sigma_k
// ===> [gamma_i, gamma_j] =-2 epsilon_i_j_k (sigma_k 0)
// (0 sigma_k)
// [gamma_i, gamma_t] = 2 (sigma_i 0)
// (0 -sigma_i)
// A = I + kappa * Csw * (termB - termE 0 )
// (0 termB + termE )
// where termB = epsilon_ijk sigma_k Fij
// = sigma_z Fxy + sigma_x Fyz - sigma_y Fxz
// = ( i Fxy i Fyz + Fxz )
// ( i Fyz - Fxz - i Fxy )
// = ( i F01 i F12 + F02 )
// ( i F12 - F02 - i F01 )
// = ( i F01 ___ dagger )
// ( i F12 - F02 __/ - i F01 )
// termE = sigma_x Fxt + sigma_y Fyt + sigma_z Fzt
// = ( i Fzt i Fxt - Fyt )
// ( i Fxt + Fyt - i Fzt )
// = ( i F23 dagger )
// ( i F03 + F13 - i F23 )
// A = A0 A1^dag 0 0
// A1 A2 0 0
// 0 0 A3 A4^dag
// 0 0 A4 A5
// where A0 = I + kappa * Csw * i(F01 - F23)
// A1 = kappa * Csw * [(iF12-F02) - (iF03+F13)]
// A2 = I - kappa * Csw * i(F01 - F23)
// A3 = I + kappa * Csw * i(F01 + F23)
// A4 = kappa * Csw * [(iF12-F02) + (iF03+F13)]
// A5 = I - kappa * Csw * i(F01 + F23)
How does this form compare to that used in Chroma and QOP?
The QOPQDP implementation is in qopqdp-0.17.6/lib/wilson_dslash_p.c. It was written by Bugra, so I'm assuming that this is the same code as in MILC.
The function f_mu_nu() computes a single (mu, nu) component of F, using a total of 10 temporary color matrices in the process. It gets called a total of 6 times (once for each unique component) by get_clov(), which keeps around 5 temps.
Both functions are well-commented, but since they're written using QDP shifts and (probably as a result) a lot of temps, a direct port would perform terribly.
For Fmunu I think this will be straight forward:
The only issue (which is an immediate one) is that clover and dslash would use different gauge field ghost zone storage (extended versus padded). This highlights another reason why we need to get extended field support for the dslashes.
Moving this to 0.5 since the need for this is less desperate given that Chroma can do these computations on the GPU using qdp/jit.
Note that the clover term creation has now been completed, and is multi-GPU capable, this is presently in the quda-0.7 branch). The issue regarding stouting kernels remains open however.
I'm going to close this issue since we have all the clover functionality, and create a separate issue to track the implementation of stout link creation.
Hi All, Mike asked me to point out what needs to be done for the clover program to work correctly on GPUs. -- Basically, what are the pieces that our 'outside of QDP++' that QDP-JIT cannot handle.
There are two essentially two source files in chroma that need to be re-worked as far as I am aware:
one is
chroma/lib/actions/ferm/linop/clover_term_qdp_w.h (http://git.jlab.org/cgi-bin/gitweb/gitweb.cgi?p=chroma.git;a=blob;f=lib/actions/ferm/linop/clover_term_qdp_w.h;h=8f80c954b77146d271c469dba4704a72953dde8d;hb=HEAD)
the other is
chroma/lib/util/gauge/stout_utils.cc (http://git.jlab.org/cgi-bin/gitweb/gitweb.cgi?p=chroma.git;a=blob;f=lib/util/gauge/stout_utils.cc;h=c605f4e7cb3479d87a2965730bd6c426becf2546;hb=HEAD)
The first one is the actual clover term implementation, the second one implements utilities used in stout smearing
In the clover_term_qdp_w.h we have a) A data structure for the cloverTerm ( template struct PrimitiveClovTriang )
which stores the clover term in 2 blocks. Each block has a real-diagonal part and a complex off diagonal part)
template
struct PrimitiveClovTriang
{
RScalar diag[2][2*Nc];
RComplex offd[2][2_Nc_Nc-Nc];
};
for current QUDA use, we also have another structure
template
struct QUDAPackedClovSite {
R diag1[6];
R offDiag1[15][2];
R diag2[6];
R offDiag2[15][2];
};
which packs a site in the order expected by loadClover() in the QUDA lib. For a QUDA reimplementation of this file
( clover_term_quda.h ?) We could deprecate the PrimitiveClovTriang and work immediately in the QUDAPackedClovSite format.
The class QDPCloverTermT is templated on the lattice fermion type T, and the gauge field type U -- assuming that T is something like: LatticeDiracFermionF3, or LatticeFermion, LatticeFermionD etc, and U is LatticeColorMatrix, LatticeColorMatrixF, LatticeColorMatrixD etc.
Implements the following functions: variants of:
void create(... ) -- create the clover term, fill out the PrimitveClovTriangStruct, etc
void choles() - perform an inversion of the clover term using Cholesky (L^\dagger L) decomposiont void cholesDet() -- compute determinant of the clover term using the Cholesky decomposition void apply() -- apply the clover term to a Fermion void triacntr() -- compute Tr ( gamma_matrix Clover ) -- needed in forces
protected:
void makeClov() -- given the field strength tensor F, and a mass fill out the PrimitiveClovTriang term -- called by create() void chlclovms() -- the actual inversison by cholesky void ldagdlinv() -- inversion of the clover term by L^\dagger D L factorization (alternative to cholesky)
Essentially most of these codes have been threaded already, and so have been separated into an 'argument structure' and a 'siteLoop' function. These are usually enclosed in a separate namespace.
e.g. in the file clover_term_qdp_w.h one has
namespace QDPCloverEnv {
template
struct QDPCloverMakeClovArg {
typedef typename WordType::Type_t REALT;
typedef OScalar< PScalar< PScalar< RScalar > > > RealT;
const RealT& diag_mass;
const U& f0;
const U& f1;
const U& f2;
const U& f3;
const U& f4;
const U& f5;
multi1d< PrimitiveClovTriang < REALT > >& tri;
};
/* This is the extracted site loop for makeClover / template
inline
void makeClovSiteLoop(int lo, int hi, int myId, QDPCloverMakeClovArg a)
{
typedef typename QDPCloverMakeClovArg::RealT RealT;
typedef typename QDPCloverMakeClovArg::REALT REALT;
const RealT& diag_mass = a->diag_mass; const U& f0=a->f0; const U& f1=a->f1; const U& f2=a->f2; const U& f3=a->f3; const U& f4=a->f4; const U& f5=a->f5; multi1d<PrimitiveClovTriang < REALT > >& tri=a->tri;
// SITE LOOP STARTS HERE for(int site = lo; site < hi; ++site) { /# Construct diagonal / ....
} // end site loop } // end function
The function makeClov just dispatches these:
/* This now just sets up and dispatches... */ template<typename T, typename U> void QDPCloverTermT<T,U>::makeClov(const multi1d& f, const RealT& diag_mass) { START_CODE();
if ( Nd != 4 ){ QDPIO::cerr << func << ": expecting Nd==4" << endl; QDP_abort(1); }
if ( Ns != 4 ){ QDPIO::cerr << func << ": expecting Ns==4" << endl; QDP_abort(1); }
// Set up the imputs of the argument structure U f0 = f[0] * getCloverCoeff(0,1); U f1 = f[1] * getCloverCoeff(0,2); U f2 = f[2] * getCloverCoeff(0,3); U f3 = f[3] * getCloverCoeff(1,2); U f4 = f[4] * getCloverCoeff(1,3); U f5 = f[5] * getCloverCoeff(2,3);
const int nodeSites = QDP::Layout::sitesOnNode();
tri.resize(nodeSites); // hold local lattice
// Set up the argument structure itself QDPCloverEnv::QDPCloverMakeClovArg arg = {diag_mass, f0,f1,f2,f3,f4,f5,tri };
// Dispatch to the QMT/OpenMP threading dispatch_to_threads(nodeSites, arg, QDPCloverEnv::makeClovSiteLoop);
END_CODE(); }
In principle this should be straightforward to QUDA-ise, since essentially the Site-loops would need to be turned into kernels (now each one working on just 1 site) -- Down side tho is that these kernels currently use QDP++/Chroma types (like U=LatticeColorMatrix) which would probably pollute QUDA excessively. I still think it is best to leave these kernels in Chroma personally.
That having been said I see the following 'siteLoops' in clover_term_qdp_w.h
QDPCloverMakeClovArg + makeClovSiteLoop + dispatch (line 360-516) LDagDLInvArgs + LDagDLInvSiteLoop + dispatch (line 570- 804) cholesSiteLoop + Dispatch (reusing LDagDLInvArgs) (line 817-991) TriaCntrArgs + triaCntrSiteLoop + dispatch (line 1154-1499) ApplyArgs + applySiteLoop + dispatch (line 1502- 2088) -- but lot of it ifdef-ed out
Then finally there is a site loop that repacks the PrimitiveClovTriang into the QUDAPackedClovSite form If the new versions just used the QUDAPackedClovSite form by default this could be eliminated. Currently this runs from
QUDAPackArgs + qudaPackSiteLoop + dispatch (line 2147-2158)
In terms of stout links there is only one 'site loop that I can see:
GetFsandBsArgs + getFsAndBsSiteLoop + dispatch lib/util/gauge/stout_utils.cc lines: 350- 847)
and that's it for now. There may be other dirty bits. If these proved to be awkward we'd have to work on them on a case by case basis.
Best, B