Closed kostrzewa closed 3 years ago
@kostrzewa commented on this pull request.
@@ -7,32 +7,32 @@
namespace quda {
- template <typename Float, int nColor, QudaReconstructType reconstruct_, bool dynamicclover>
this template signature now looks different, I think (see
TwistedCloverArg
)
yes, I already started to change it...
Just to keep you up to date, I'm making progress on the test for the nonPC operator and have reached the stage where I'm adjusting the signature of tmc_ndeg_mat
to take into account that the number of flavours as well as the checkerboarding are properties of the ColorSpinorField
(such that also the two-flavour operator has only a single input and a single output field, such that the starting indices of the parities and flavours have to be determined at the point of entry).
Running dslash_test
for NdegTwistedClover
, we get:
Calculating reference implementation...done.
Tuning...
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x5555757d9550 (gauge_field_order.h:1636)
Thread 1 "dslash_test" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 12184, block (20067,0,0), thread (0,0,0), device 0, sm 0, warp 21, lane 0]
quda::dslashGPU<quda::nDegTwistedClover, quda::packShmem, 2, false, true, (quda::KernelType)5, quda::NdegTwistedCloverArg<float, 3, 4, (QudaReconstructType_s)18> ><<<(20736,2,2),(16,1,1)>>> () at /home/bartek/code/quda.ndeg-twisted-clover/lib/../include/color_spinor.h:1225
1225 y.data[s*Nc + i].y += A(i,j).real() * x.data[s*Nc + j].imag();
So I think now would perhaps be a good time to PR this to the QUDA devs who might immediately be able to tell what's going wrong...
The template arguments with which the kernel is instantiated can be identified via qudaLaunchKernel(dslashGPU<D, P, nParity, dagger, xpay, kernel_type, Arg>, tp, stream, arg);
.
Yes, a PR is a good idea! Will you or shall I? -- CU Mobil, www.carsten-urbach.eu
I'll open one in the next few days. Fixing this was actually very instructive. You had the right idea commenting on the in.VolumeCB()
at the point of the volume loop launch in the functor. Turns out VolumeCB()
is too large by a factor of two on account of our abusing the fifth dimension for flavour. This is why NdegTwistedMass uses in.getDslashConstant().volume_4d_cb
instead and we have to do the same here.
Okay, I see. I had put this in comment because I wanted to investigate this at some point. Now you did already! Very good.
Sorry, I'll probably only be able to look at the things you did on Monday. -- CU Mobil, www.carsten-urbach.eu
Sorry, I'll probably only be able to look at the things you did on Monday.
Sure, I just happened to have the motivation and time yesterday and today to go through all of this.
The test is reasonably close to passing now
7134.751081us per kernel call
1337720832 flops per kernel call, 2016 flops per site
GFLOPS = 187.493693
Effective halo bi-directional bandwidth (GB/s) GPU = 0.744023 ( CPU = 0.744031, min = 0.652300 , max = 0.767334 ) for aggregate message size 5308416 bytes
Results: CPU = 7434030.804202, CUDA=7499751.827586, CPU-CUDA = 7499751.827588
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from dslash
[ RUN ] dslash.verify
0 fails = 108293
1 fails = 107960
2 fails = 108306
3 fails = 107985
4 fails = 108229
5 fails = 107980
6 fails = 108280
7 fails = 107959
8 fails = 108286
9 fails = 108015
10 fails = 108350
11 fails = 107935
12 fails = 108293
13 fails = 107960
14 fails = 108306
15 fails = 107985
16 fails = 108229
17 fails = 107980
18 fails = 108280
19 fails = 107959
20 fails = 108287
21 fails = 108015
22 fails = 108350
23 fails = 107935
1.000000e-01 Failures: 150792 / 31850496 = 4.734369e-03
1.000000e-02 Failures: 2074728 / 31850496 = 6.513958e-02
1.000000e-03 Failures: 2595157 / 31850496 = 8.147933e-02
1.000000e-04 Failures: 2648197 / 31850496 = 8.314461e-02
1.000000e-05 Failures: 2653559 / 31850496 = 8.331296e-02
1.000000e-06 Failures: 2654137 / 31850496 = 8.333110e-02
1.000000e-07 Failures: 2829939 / 31850496 = 8.885070e-02
1.000000e-08 Failures: 22727814 / 31850496 = 7.135780e-01
1.000000e-09 Failures: 30902545 / 31850496 = 9.702375e-01
1.000000e-10 Failures: 31755450 / 31850496 = 9.970159e-01
1.000000e-11 Failures: 31840955 / 31850496 = 9.997004e-01
1.000000e-12 Failures: 31849553 / 31850496 = 9.999704e-01
1.000000e-13 Failures: 31850409 / 31850496 = 9.999973e-01
1.000000e-14 Failures: 31850483 / 31850496 = 9.999996e-01
1.000000e-15 Failures: 31850496 / 31850496 = 1.000000e+00
1.000000e-16 Failures: 31850496 / 31850496 = 1.000000e+00
/home/bartek/code/quda.ndeg-twisted-clover/tests/dslash_test.cpp:926: Failure
Expected: (deviation) <= (tol), actual: 1 vs 0.0001
CPU and CUDA implementations do not agree
[ FAILED ] dslash.verify (3964 ms)
I'll review the host imlementation as soo as I can.
I'll review the host imlementation as soo as I can.
Sorry, the "request for review" was really just to get us both subscribed to this issue, please don't feel under time pressure because I happened to work on this yesterday.
I'll review the host imlementation as soo as I can.
Sorry, the "request for review" was really just to get us both subscribed to this issue, please don't feel under time pressure because I happened to work on this yesterday.
all fine!
I have to take the decision whether or not to keep the asymmetric
and !xpay
versions of the preconditioned kernel. I think I'll first go without now. Any additional thoughts?
on the other hand, we would break the rule I suppose...
I have to take the decision whether or not to keep the asymmetric and !xpay versions of the preconditioned kernel. I think I'll first go without now. Any additional thoughts?
I think we will need the !xpay
version in the HMC, right? Although of course one could also rely on the xpay = true
version and accumulate a zero zpinor.
Of course, much like in the case of DiracTwistedClover
, we can build this out of a special function applying the inverse twisted clover term and call the Wilson Dslash
(see lib/dirac_twisted_clover.cpp
-> DiracTwistedCloverPC::Dslash
)
I've put a bunch of TODOs at the top which might help in scoping out the tasks ahead. Not sure if this is the best way of organising things as this PR will get rather full rather quickly I presume.
@urbach I'm currently merging in the latest changes from feature/generic_kernel
and there is a minor change relevant to the shared memory allocation in the preconditioned kernel which will have to be propagated to the ndeg wisted clover preconditioned kernel.
diff --git a/include/kernels/dslash_ndeg_twisted_mass_preconditioned.cuh b/include/kernels/dslash_ndeg_twisted_mass_preconditioned.cuh
index 0891fa2c1..16a255831 100644
--- a/include/kernels/dslash_ndeg_twisted_mass_preconditioned.cuh
+++ b/include/kernels/dslash_ndeg_twisted_mass_preconditioned.cuh
@@ -94,7 +94,7 @@ namespace quda
if (isComplete<kernel_type>(arg, coord) && active) {
if (!dagger || Arg::asymmetric) { // apply A^{-1} to D*in
- SharedMemoryCache<Vector> cache(device::block_dim());
+ SharedMemoryCache<Vector> cache(target::block_dim());
// to apply the preconditioner we need to put "out" in shared memory so the other flavor can access it
cache.save(out);
cache.sync(); // safe to sync in here since other threads will exit
Note also that Kepler support has been dropped.
As discussed, here's the PR for build testing: https://github.com/lattice/quda/pull/1121
Yes, did so already... -- CU Mobil, www.carsten-urbach.eu
@urbach I've added tmc_ndeg_dslash
which can be used to test the non-degerate DiracTwistedCloverPC::Dslash
. Maybe we can have a chat one of these days to discuss again.
If you want, I can start wiring up the operators DiracTwistedCloverPC::Dslash
, DiracTwistedCloverPC::DslashXpay
, DiracTwistedCloverPC::M
and DiracTwistedCloverPC::MdagM
to call the appropriate Apply*
functions (and combinations of other functions) based on the flavour and preconditioning type. I just don't want to cause conflicts with your local copy.
If you want, I can start wiring up the operators
DiracTwistedCloverPC::Dslash
,DiracTwistedCloverPC::DslashXpay
,DiracTwistedCloverPC::M
andDiracTwistedCloverPC::MdagM
to call the appropriateApply*
functions (and combinations of other functions) based on the flavour type. I just don't want to cause conflicts with your local copy.
please go ahead, I can cope with conflicts...
If I understand correctly, then what is currently in dslash_ndeg_twisted_clover_preconditioned
is the symmetric E/O preconditioned form, isn't it?
no, okay, no I understand, sorry for the confusion
the issue below has been fixed
I think we can't do this (which is the call for the nondeg operator):
196 void DiracTwistedCloverPC::Dslash(ColorSpinorField &out, const ColorSpinorField &in, const QudaParity parity) const
197 {
198 checkParitySpinor(in, out);
199 checkSpinorAlias(in, out);
200
201 if (in.TwistFlavor() != out.TwistFlavor())
202 errorQuda("Twist flavors %d %d don't match", in.TwistFlavor(), out.TwistFlavor());
203 if (in.TwistFlavor() == QUDA_TWIST_NO || in.TwistFlavor() == QUDA_TWIST_INVALID)
204 errorQuda("Twist flavor not set %d", in.TwistFlavor());
205
206 bool symmetric = (matpcType == QUDA_MATPC_EVEN_EVEN || matpcType == QUDA_MATPC_ODD_ODD) ? true : false;
207 if (dagger && symmetric && !reverse) {
208 printf("TwistCloverInv -> DiracWilson::Dslash\n");
209 bool reset = newTmp(&tmp2, in);
210 TwistCloverInv(*tmp2, in, 1 - parity);
211 DiracWilson::Dslash(out, *tmp2, parity);
212 deleteTmp(&tmp2, reset);
213 } else {
as WilsonApply
, which is instantiated by ApplyWilson
, which is what is called in DiracWilson::Dslash
, refers to in.VolumeCB()
:
35 inline WilsonApply(ColorSpinorField &out, const ColorSpinorField &in, const GaugeField &U, double a,
36 const ColorSpinorField &x, int parity, bool dagger, const int *comm_override, TimeProfile &profile)
37 {
38 constexpr int nDim = 4;
39 WilsonArg<Float, nColor, nDim, recon> arg(out, in, U, a, x, parity, dagger, comm_override);
40 Wilson<decltype(arg)> wilson(arg, out, in);
41
42 dslash::DslashPolicyTune<decltype(wilson)> policy(
43 wilson, const_cast<cudaColorSpinorField *>(static_cast<const cudaColorSpinorField *>(&in)), in.VolumeCB(),
44 in.GhostFaceCB(), profile);
45 }
46 };
which is a 5D volume as we've observed previously. I'm not sure if we'll break anything by moving to volume_4d_cb
here and then I don't know if we need to call the Dslash twice, once for each flavour...
Alright, so the Dslash
test cannot pass for asymmetric
preconditioning because the host reference code is wrong. The second application of the inverse diagonal term is simply not the correct thing to do.
The asymmetric M
is implemented by a call to DiracTwistedCloverPC::Dslash
, followed by a call to DiracTwistedClover::DslashXpay
with appropriate arguments, such that one ends up with M_{oo} \psi_o - \kappa^2 D_{oe} M_{ee}^{-1} D_{eo} \psi_o
in the output, as required. As a result, I think I'll be able to wire everything up today with the understanding that the Dslash
test will continue to fail for the asymmetric preconditioning types, but MatPC
and company should pass.
Things are clearer now after I've wired up all the tests. Unlike for the normal non-deg twisted mass operator, we need DiracTwistedCloverPC::DslashXpay
as it is required by the non-daggered symmetric PC operator:
257 void DiracTwistedCloverPC::M(ColorSpinorField &out, const ColorSpinorField &in) const
258 {
259 double kappa2 = -kappa*kappa;
260 bool reset = newTmp(&tmp1, in);
261
262 bool symmetric =(matpcType == QUDA_MATPC_EVEN_EVEN || matpcType == QUDA_MATPC_ODD_ODD) ? true : false;
263 int odd_bit = (matpcType == QUDA_MATPC_ODD_ODD || matpcType == QUDA_MATPC_ODD_ODD_ASYMMETRIC) ? 1 : 0;
264 QudaParity parity[2] = {static_cast<QudaParity>((1 + odd_bit) % 2), static_cast<QudaParity>((0 + odd_bit) % 2)};
265
266 if (!symmetric) { // asymmetric preconditioning
267 Dslash(*tmp1, in, parity[0]);
268 DiracTwistedClover::DslashXpay(out, *tmp1, parity[1], in, kappa2);
269 } else if (!dagger) { // symmetric preconditioning
270 Dslash(*tmp1, in, parity[0]);
271 DslashXpay(out, *tmp1, parity[1], in, kappa2);
272 } else { // symmetric preconditioning, dagger
273 TwistCloverInv(out, in, parity[1]);
274 reverse = true;
275 Dslash(*tmp1, out, parity[0]);
276 reverse = false;
277 DiracWilson::DslashXpay(out, *tmp1, parity[1], in, kappa2);
278 }
279
280 deleteTmp(&tmp1, reset);
281 }
At this stage I would ask (if possible) for a review of the tmc_ndeg_matpc
for silly mistakes. Beyond that we'll soon have the tests set up for the non-deg-DiracTwistedCloverPC::M
(and company) as I've modified the kernel and functor to follow the single-flavour operator more closely (rather than the non-deg twisted mass operator, which is special because it affects how communication is done since the inverse of the diagonal is trivial and can be applied more efficiently than the inverse of the diagonal of the clover operator).
Reviewed this myself and found a bug in the odd-odd-asym
.
If possible a review and merge of #11 would be nice. All the tests for PC::M
and PC::MdagM
pass now, such that "only" source preparation and reconstruction remain to be generalised to two flavours.
Beyond this, a review of the flop and byte counters should be done as well as a check of the multi-shift solver to make sure that it supports the two-flavour case. (probably, as it surely must support DWF, but you never know...)
I think the source preparation and reconstruction also goes through trivially with the latest commit to #11 thanks to the modifications that I made to ApplyWilson
.
Testing prepare
and reconstruct
means running (and possibly adjusting) invert_test
with the correct arguments for --matpc-type
and --solution-type
. Complementarily, one could also proceed to extend the tmLQCD QUDA interface to fully support the non-degenerate operators, thus directly testing against tmLQCD.
ugh, latest changes to the generic_kernel branch completely broke our 2-flavour operator... running https://github.com/qcdcode/helper_scripts/blob/master/quda/dslash_test/run_test_NdegTwistedClover.sh I get only failures:
Mat_dag-0_matpc-none_rval1
Mat_dag-1_matpc-none_rval1
MatDagMat_dag-0_matpc-none_rval1
MatDagMat_dag-1_matpc-none_rval1
Dslash_dag-0_matpc-odd-odd_rval1
Dslash_dag-0_matpc-even-even_rval1
Dslash_dag-0_matpc-odd-odd-asym_rval1
Dslash_dag-0_matpc-even-even-asym_rval1
Dslash_dag-1_matpc-odd-odd_rval1
Dslash_dag-1_matpc-even-even_rval1
Dslash_dag-1_matpc-odd-odd-asym_rval1
Dslash_dag-1_matpc-even-even-asym_rval1
MatPC_dag-0_matpc-odd-odd_rval1
MatPC_dag-0_matpc-even-even_rval1
MatPC_dag-0_matpc-odd-odd-asym_rval1
MatPC_dag-0_matpc-even-even-asym_rval1
MatPC_dag-1_matpc-odd-odd_rval1
MatPC_dag-1_matpc-even-even_rval1
MatPC_dag-1_matpc-odd-odd-asym_rval1
MatPC_dag-1_matpc-even-even-asym_rval1
MatPCDagMatPC_dag-0_matpc-odd-odd_rval1
MatPCDagMatPC_dag-0_matpc-even-even_rval1
MatPCDagMatPC_dag-0_matpc-odd-odd-asym_rval1
MatPCDagMatPC_dag-0_matpc-even-even-asym_rval1
MatPCDagMatPC_dag-1_matpc-odd-odd_rval1
MatPCDagMatPC_dag-1_matpc-even-even_rval1
MatPCDagMatPC_dag-1_matpc-odd-odd-asym_rval1
MatPCDagMatPC_dag-1_matpc-even-even-asym_rval1
argh, super... :(
The issue is with comms, the operators still pass the tests when run with a single MPI task.
Just a test PR to our fork of the generic_kernel branch to quickly compare what has changed and to be notified of changes as they are made (if desired).
TODOs
a checkmark means that the non-deg version is wired up completely, has a
host_reference
implementation (if applicable) and the test passesDiracTwistedClover::M
#6DiracTwistedClover::TwistClover
DiracTwistedClover::DslashXpay
DiracTwistedClover::MdagM
DiracTwistedCloverPC::TwistCloverInv
DiracTwistedCloverPC::Dslash
DiracTwistedCloverPC::DslashXpay
DiracTwistedCloverPC::M
DiracTwistedCloverPC::MdagM
DiracTwistedCloverPC::prepare
DiracTwistedCloverPC::reconstruct
DiracTwistedCloverPC::prefetch
CloverInvert::apply
(adapt to nondeg, see https://github.com/qcdcode/quda/pull/4/commits/7165a1841c6cf56dfd59c62c4b1c331f74e7d3cc)For the PC operator, there's also a bunch of stuff related to the packing kernels which is non-trivial AFAIK.(does not apply to the clover operators)stuff for later
Part of
DiracTwistedClover
andDiracTwistedCloverPC
are the functionalities related to interfacing for the coarse operator. I would propose that in a first round we omit these. However, we have to make sure that the logic disallows a non-deg coarse op to be created...DiracTwistedClover::createCoarseOp
cannot be called for the non-deg operatorDiracTwistedCloverPC::createCoarseOp
cannot be called for the non-deg operator