test PR for ndeg-twisted-clover branch to keep up to date on changes

kostrzewa commented 3 years ago

Just a test PR to our fork of the generic_kernel branch to quickly compare what has changed and to be notified of changes as they are made (if desired).

TODOs

a checkmark means that the non-deg version is wired up completely, has a host_reference implementation (if applicable) and the test passes

[x] DiracTwistedClover::M #6
[x] DiracTwistedClover::TwistClover
[x] DiracTwistedClover::DslashXpay
[x] DiracTwistedClover::MdagM
[x] DiracTwistedCloverPC::TwistCloverInv
[x] DiracTwistedCloverPC::Dslash
[x] DiracTwistedCloverPC::DslashXpay
[x] DiracTwistedCloverPC::M
[x] DiracTwistedCloverPC::MdagM
[x] DiracTwistedCloverPC::prepare
[x] DiracTwistedCloverPC::reconstruct
[x] DiracTwistedCloverPC::prefetch
[x] CloverInvert::apply (adapt to nondeg, see https://github.com/qcdcode/quda/pull/4/commits/7165a1841c6cf56dfd59c62c4b1c331f74e7d3cc)

~~For the PC operator, there's also a bunch of stuff related to the packing kernels which is non-trivial AFAIK.~~ (does not apply to the clover operators)

stuff for later

[ ] Go through all operators and correct flops / bytes

Part of DiracTwistedClover and DiracTwistedCloverPC are the functionalities related to interfacing for the coarse operator. I would propose that in a first round we omit these. However, we have to make sure that the logic disallows a non-deg coarse op to be created...

[ ] check that DiracTwistedClover::createCoarseOp cannot be called for the non-deg operator
[ ] check that DiracTwistedCloverPC::createCoarseOp cannot be called for the non-deg operator

urbach commented 3 years ago

@kostrzewa commented on this pull request.

@@ -7,32 +7,32 @@

namespace quda {

template <typename Float, int nColor, QudaReconstructType reconstruct_, bool dynamicclover>

this template signature now looks different, I think (see TwistedCloverArg)

yes, I already started to change it...

kostrzewa commented 3 years ago

Just to keep you up to date, I'm making progress on the test for the nonPC operator and have reached the stage where I'm adjusting the signature of tmc_ndeg_mat to take into account that the number of flavours as well as the checkerboarding are properties of the ColorSpinorField (such that also the two-flavour operator has only a single input and a single output field, such that the starting indices of the parities and flavours have to be determined at the point of entry).

kostrzewa commented 3 years ago

See https://github.com/qcdcode/helper_scripts/blob/master/quda/run_test_NdegTwistedClover.sh for run script.

kostrzewa commented 3 years ago

Running dslash_test for NdegTwistedClover, we get:

Calculating reference implementation...done.
Tuning...

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x5555757d9550 (gauge_field_order.h:1636)

Thread 1 "dslash_test" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 12184, block (20067,0,0), thread (0,0,0), device 0, sm 0, warp 21, lane 0]
quda::dslashGPU<quda::nDegTwistedClover, quda::packShmem, 2, false, true, (quda::KernelType)5, quda::NdegTwistedCloverArg<float, 3, 4, (QudaReconstructType_s)18> ><<<(20736,2,2),(16,1,1)>>> () at /home/bartek/code/quda.ndeg-twisted-clover/lib/../include/color_spinor.h:1225
1225          y.data[s*Nc + i].y += A(i,j).real() * x.data[s*Nc + j].imag();

So I think now would perhaps be a good time to PR this to the QUDA devs who might immediately be able to tell what's going wrong...

The template arguments with which the kernel is instantiated can be identified via qudaLaunchKernel(dslashGPU<D, P, nParity, dagger, xpay, kernel_type, Arg>, tp, stream, arg);.

urbach commented 3 years ago

Yes, a PR is a good idea! Will you or shall I? -- CU Mobil, www.carsten-urbach.eu

kostrzewa commented 3 years ago

I'll open one in the next few days. Fixing this was actually very instructive. You had the right idea commenting on the in.VolumeCB() at the point of the volume loop launch in the functor. Turns out VolumeCB() is too large by a factor of two on account of our abusing the fifth dimension for flavour. This is why NdegTwistedMass uses in.getDslashConstant().volume_4d_cb instead and we have to do the same here.

urbach commented 3 years ago

Okay, I see. I had put this in comment because I wanted to investigate this at some point. Now you did already! Very good.

Sorry, I'll probably only be able to look at the things you did on Monday. -- CU Mobil, www.carsten-urbach.eu

kostrzewa commented 3 years ago

Sorry, I'll probably only be able to look at the things you did on Monday.

Sure, I just happened to have the motivation and time yesterday and today to go through all of this.

The test is reasonably close to passing now

7134.751081us per kernel call
1337720832 flops per kernel call, 2016 flops per site
GFLOPS = 187.493693
Effective halo bi-directional bandwidth (GB/s) GPU = 0.744023 ( CPU = 0.744031, min = 0.652300 , max = 0.767334 ) for aggregate message size 5308416 bytes
Results: CPU = 7434030.804202, CUDA=7499751.827586, CPU-CUDA = 7499751.827588
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from dslash
[ RUN      ] dslash.verify
0 fails = 108293
1 fails = 107960
2 fails = 108306
3 fails = 107985
4 fails = 108229
5 fails = 107980
6 fails = 108280
7 fails = 107959
8 fails = 108286
9 fails = 108015
10 fails = 108350
11 fails = 107935
12 fails = 108293
13 fails = 107960
14 fails = 108306
15 fails = 107985
16 fails = 108229
17 fails = 107980
18 fails = 108280
19 fails = 107959
20 fails = 108287
21 fails = 108015
22 fails = 108350
23 fails = 107935
1.000000e-01 Failures: 150792 / 31850496  = 4.734369e-03
1.000000e-02 Failures: 2074728 / 31850496  = 6.513958e-02
1.000000e-03 Failures: 2595157 / 31850496  = 8.147933e-02
1.000000e-04 Failures: 2648197 / 31850496  = 8.314461e-02
1.000000e-05 Failures: 2653559 / 31850496  = 8.331296e-02
1.000000e-06 Failures: 2654137 / 31850496  = 8.333110e-02
1.000000e-07 Failures: 2829939 / 31850496  = 8.885070e-02
1.000000e-08 Failures: 22727814 / 31850496  = 7.135780e-01
1.000000e-09 Failures: 30902545 / 31850496  = 9.702375e-01
1.000000e-10 Failures: 31755450 / 31850496  = 9.970159e-01
1.000000e-11 Failures: 31840955 / 31850496  = 9.997004e-01
1.000000e-12 Failures: 31849553 / 31850496  = 9.999704e-01
1.000000e-13 Failures: 31850409 / 31850496  = 9.999973e-01
1.000000e-14 Failures: 31850483 / 31850496  = 9.999996e-01
1.000000e-15 Failures: 31850496 / 31850496  = 1.000000e+00
1.000000e-16 Failures: 31850496 / 31850496  = 1.000000e+00
/home/bartek/code/quda.ndeg-twisted-clover/tests/dslash_test.cpp:926: Failure
Expected: (deviation) <= (tol), actual: 1 vs 0.0001
CPU and CUDA implementations do not agree
[  FAILED  ] dslash.verify (3964 ms)

urbach commented 3 years ago

I'll review the host imlementation as soo as I can.

kostrzewa commented 3 years ago

I'll review the host imlementation as soo as I can.

Sorry, the "request for review" was really just to get us both subscribed to this issue, please don't feel under time pressure because I happened to work on this yesterday.

urbach commented 3 years ago

I'll review the host imlementation as soo as I can.

Sorry, the "request for review" was really just to get us both subscribed to this issue, please don't feel under time pressure because I happened to work on this yesterday.

all fine!

urbach commented 3 years ago

I have to take the decision whether or not to keep the asymmetric and !xpay versions of the preconditioned kernel. I think I'll first go without now. Any additional thoughts?

urbach commented 3 years ago

on the other hand, we would break the rule I suppose...

kostrzewa commented 3 years ago

I have to take the decision whether or not to keep the asymmetric and !xpay versions of the preconditioned kernel. I think I'll first go without now. Any additional thoughts?

I think we will need the !xpay version in the HMC, right? Although of course one could also rely on the xpay = true version and accumulate a zero zpinor.

Of course, much like in the case of DiracTwistedClover, we can build this out of a special function applying the inverse twisted clover term and call the Wilson Dslash (see lib/dirac_twisted_clover.cpp -> DiracTwistedCloverPC::Dslash)

kostrzewa commented 3 years ago

I've put a bunch of TODOs at the top which might help in scoping out the tasks ahead. Not sure if this is the best way of organising things as this PR will get rather full rather quickly I presume.

kostrzewa commented 3 years ago

@urbach I'm currently merging in the latest changes from feature/generic_kernel and there is a minor change relevant to the shared memory allocation in the preconditioned kernel which will have to be propagated to the ndeg wisted clover preconditioned kernel.

diff --git a/include/kernels/dslash_ndeg_twisted_mass_preconditioned.cuh b/include/kernels/dslash_ndeg_twisted_mass_preconditioned.cuh
index 0891fa2c1..16a255831 100644
--- a/include/kernels/dslash_ndeg_twisted_mass_preconditioned.cuh
+++ b/include/kernels/dslash_ndeg_twisted_mass_preconditioned.cuh
@@ -94,7 +94,7 @@ namespace quda

       if (isComplete<kernel_type>(arg, coord) && active) {
         if (!dagger || Arg::asymmetric) { // apply A^{-1} to D*in
-          SharedMemoryCache<Vector> cache(device::block_dim());
+          SharedMemoryCache<Vector> cache(target::block_dim());
           // to apply the preconditioner we need to put "out" in shared memory so the other flavor can access it
           cache.save(out);
           cache.sync(); // safe to sync in here since other threads will exit

Note also that Kepler support has been dropped.

kostrzewa commented 3 years ago

As discussed, here's the PR for build testing: https://github.com/lattice/quda/pull/1121

urbach commented 3 years ago

Yes, did so already... -- CU Mobil, www.carsten-urbach.eu

kostrzewa commented 3 years ago

@urbach I've added tmc_ndeg_dslash which can be used to test the non-degerate DiracTwistedCloverPC::Dslash. Maybe we can have a chat one of these days to discuss again.

kostrzewa commented 3 years ago

If you want, I can start wiring up the operators DiracTwistedCloverPC::Dslash, DiracTwistedCloverPC::DslashXpay, DiracTwistedCloverPC::M and DiracTwistedCloverPC::MdagM to call the appropriate Apply* functions (and combinations of other functions) based on the flavour and preconditioning type. I just don't want to cause conflicts with your local copy.

urbach commented 3 years ago

If you want, I can start wiring up the operators DiracTwistedCloverPC::Dslash, DiracTwistedCloverPC::DslashXpay, DiracTwistedCloverPC::M and DiracTwistedCloverPC::MdagM to call the appropriate Apply* functions (and combinations of other functions) based on the flavour type. I just don't want to cause conflicts with your local copy.

please go ahead, I can cope with conflicts...

urbach commented 3 years ago

If I understand correctly, then what is currently in dslash_ndeg_twisted_clover_preconditioned is the symmetric E/O preconditioned form, isn't it?

urbach commented 3 years ago

no, okay, no I understand, sorry for the confusion

kostrzewa commented 3 years ago

the issue below has been fixed

old comment

I think we can't do this (which is the call for the nondeg operator):

196   void DiracTwistedCloverPC::Dslash(ColorSpinorField &out, const ColorSpinorField &in, const QudaParity parity) const
197   {
198     checkParitySpinor(in, out);
199     checkSpinorAlias(in, out);
200 
201     if (in.TwistFlavor() != out.TwistFlavor())
202       errorQuda("Twist flavors %d %d don't match", in.TwistFlavor(), out.TwistFlavor());
203     if (in.TwistFlavor() == QUDA_TWIST_NO || in.TwistFlavor() == QUDA_TWIST_INVALID)
204       errorQuda("Twist flavor not set %d", in.TwistFlavor());
205 
206     bool symmetric = (matpcType == QUDA_MATPC_EVEN_EVEN || matpcType == QUDA_MATPC_ODD_ODD) ? true : false;
207     if (dagger && symmetric && !reverse) {
208       printf("TwistCloverInv -> DiracWilson::Dslash\n");
209       bool reset = newTmp(&tmp2, in);
210       TwistCloverInv(*tmp2, in, 1 - parity);
211       DiracWilson::Dslash(out, *tmp2, parity);
212       deleteTmp(&tmp2, reset);
213     } else {

as WilsonApply, which is instantiated by ApplyWilson, which is what is called in DiracWilson::Dslash, refers to in.VolumeCB():

 35     inline WilsonApply(ColorSpinorField &out, const ColorSpinorField &in, const GaugeField &U, double a,
 36                        const ColorSpinorField &x, int parity, bool dagger, const int *comm_override, TimeProfile &profile)
 37     {
 38       constexpr int nDim = 4;
 39       WilsonArg<Float, nColor, nDim, recon> arg(out, in, U, a, x, parity, dagger, comm_override);
 40       Wilson<decltype(arg)> wilson(arg, out, in);
 41 
 42       dslash::DslashPolicyTune<decltype(wilson)> policy(
 43         wilson, const_cast<cudaColorSpinorField *>(static_cast<const cudaColorSpinorField *>(&in)), in.VolumeCB(),
 44         in.GhostFaceCB(), profile);
 45     }                                                                                                                                  
 46   };

which is a 5D volume as we've observed previously. I'm not sure if we'll break anything by moving to volume_4d_cb here and then I don't know if we need to call the Dslash twice, once for each flavour...

kostrzewa commented 3 years ago

Alright, so the Dslash test cannot pass for asymmetric preconditioning because the host reference code is wrong. The second application of the inverse diagonal term is simply not the correct thing to do.

The asymmetric M is implemented by a call to DiracTwistedCloverPC::Dslash, followed by a call to DiracTwistedClover::DslashXpay with appropriate arguments, such that one ends up with M_{oo} \psi_o - \kappa^2 D_{oe} M_{ee}^{-1} D_{eo} \psi_o in the output, as required. As a result, I think I'll be able to wire everything up today with the understanding that the Dslash test will continue to fail for the asymmetric preconditioning types, but MatPC and company should pass.

kostrzewa commented 3 years ago

Things are clearer now after I've wired up all the tests. Unlike for the normal non-deg twisted mass operator, we need DiracTwistedCloverPC::DslashXpay as it is required by the non-daggered symmetric PC operator:

257   void DiracTwistedCloverPC::M(ColorSpinorField &out, const ColorSpinorField &in) const                                                                  
258   {
259     double kappa2 = -kappa*kappa;
260     bool reset = newTmp(&tmp1, in);
261 
262     bool symmetric =(matpcType == QUDA_MATPC_EVEN_EVEN || matpcType == QUDA_MATPC_ODD_ODD) ? true : false;
263     int odd_bit = (matpcType == QUDA_MATPC_ODD_ODD || matpcType == QUDA_MATPC_ODD_ODD_ASYMMETRIC) ? 1 : 0;
264     QudaParity parity[2] = {static_cast<QudaParity>((1 + odd_bit) % 2), static_cast<QudaParity>((0 + odd_bit) % 2)};
265 
266     if (!symmetric) { // asymmetric preconditioning
267       Dslash(*tmp1, in, parity[0]);
268       DiracTwistedClover::DslashXpay(out, *tmp1, parity[1], in, kappa2);
269     } else if (!dagger) { // symmetric preconditioning
270       Dslash(*tmp1, in, parity[0]);
271       DslashXpay(out, *tmp1, parity[1], in, kappa2);
272     } else { // symmetric preconditioning, dagger
273       TwistCloverInv(out, in, parity[1]);
274       reverse = true;
275       Dslash(*tmp1, out, parity[0]);
276       reverse = false;
277       DiracWilson::DslashXpay(out, *tmp1, parity[1], in, kappa2);
278     }
279 
280     deleteTmp(&tmp1, reset);
281   }

kostrzewa commented 3 years ago

At this stage I would ask (if possible) for a review of the tmc_ndeg_matpc for silly mistakes. Beyond that we'll soon have the tests set up for the non-deg-DiracTwistedCloverPC::M (and company) as I've modified the kernel and functor to follow the single-flavour operator more closely (rather than the non-deg twisted mass operator, which is special because it affects how communication is done since the inverse of the diagonal is trivial and can be applied more efficiently than the inverse of the diagonal of the clover operator).

Reviewed this myself and found a bug in the odd-odd-asym.

kostrzewa commented 3 years ago

If possible a review and merge of #11 would be nice. All the tests for PC::M and PC::MdagM pass now, such that "only" source preparation and reconstruction remain to be generalised to two flavours.

Beyond this, a review of the flop and byte counters should be done as well as a check of the multi-shift solver to make sure that it supports the two-flavour case. (probably, as it surely must support DWF, but you never know...)

kostrzewa commented 3 years ago

I think the source preparation and reconstruction also goes through trivially with the latest commit to #11 thanks to the modifications that I made to ApplyWilson.

kostrzewa commented 3 years ago

Testing prepare and reconstruct means running (and possibly adjusting) invert_test with the correct arguments for --matpc-type and --solution-type. Complementarily, one could also proceed to extend the tmLQCD QUDA interface to fully support the non-degenerate operators, thus directly testing against tmLQCD.

kostrzewa commented 3 years ago

ugh, latest changes to the generic_kernel branch completely broke our 2-flavour operator... running https://github.com/qcdcode/helper_scripts/blob/master/quda/dslash_test/run_test_NdegTwistedClover.sh I get only failures:

Mat_dag-0_matpc-none_rval1
Mat_dag-1_matpc-none_rval1
MatDagMat_dag-0_matpc-none_rval1
MatDagMat_dag-1_matpc-none_rval1
Dslash_dag-0_matpc-odd-odd_rval1
Dslash_dag-0_matpc-even-even_rval1
Dslash_dag-0_matpc-odd-odd-asym_rval1
Dslash_dag-0_matpc-even-even-asym_rval1
Dslash_dag-1_matpc-odd-odd_rval1
Dslash_dag-1_matpc-even-even_rval1
Dslash_dag-1_matpc-odd-odd-asym_rval1
Dslash_dag-1_matpc-even-even-asym_rval1
MatPC_dag-0_matpc-odd-odd_rval1
MatPC_dag-0_matpc-even-even_rval1
MatPC_dag-0_matpc-odd-odd-asym_rval1
MatPC_dag-0_matpc-even-even-asym_rval1
MatPC_dag-1_matpc-odd-odd_rval1
MatPC_dag-1_matpc-even-even_rval1
MatPC_dag-1_matpc-odd-odd-asym_rval1
MatPC_dag-1_matpc-even-even-asym_rval1
MatPCDagMatPC_dag-0_matpc-odd-odd_rval1
MatPCDagMatPC_dag-0_matpc-even-even_rval1
MatPCDagMatPC_dag-0_matpc-odd-odd-asym_rval1
MatPCDagMatPC_dag-0_matpc-even-even-asym_rval1
MatPCDagMatPC_dag-1_matpc-odd-odd_rval1
MatPCDagMatPC_dag-1_matpc-even-even_rval1
MatPCDagMatPC_dag-1_matpc-odd-odd-asym_rval1
MatPCDagMatPC_dag-1_matpc-even-even-asym_rval1

urbach commented 3 years ago

argh, super... :(

kostrzewa commented 3 years ago

The issue is with comms, the operators still pass the tests when run with a single MPI task.

kostrzewa commented 3 years ago

https://github.com/lattice/quda/pull/1121

qcdcode / quda