Closed Marcogarofalo closed 2 months ago
I think that the CI actions do not trigger the test for the ndeg doublet. I try locally to run
mpirun -n 2 $1 ./tests/clover_force_test \
--dslash-type twisted-clover \
--flavor nondeg-doublet \
--compute-clover 1 \
--matpc odd-odd-asym \
--dagger \
--solution-type mat-pc-dag-mat-pc \
--dim 2 4 6 8 --prec ${prec} \
--gridsize 2 1 1 1\
--niter 1 \
--compute-clover-trlog 0\
--verbosity debug \
--enable-testing 0\
--determinant-ratio 0 \
--nsrc 1
and it passes but I have problems if I set --enable-testing 1
another Dirac operator is called. From make test
the test is launched with --enable-testing 1
.
I am also trying to measure the performance in some realistic setup.
@Marcogarofalo when you set enable-testing
the predefined tests set at the bottom of clover_force_test.cpp
will be run. What you need to do is add the non-degenerate case to this list. What I would suggest doing is adding a single/doublet option to testing tuple, e.g., add an extra bool
using test_t = ::testing::tuple<QudaPrecision, QudaDslashType, bool, int, bool>;
and then instantiate both single and double options in the test outer product
INSTANTIATE_TEST_SUITE_P(CloverForceTest, CloverForceTest,
Combine(Values(QUDA_SINGLE_PRECISION, QUDA_DOUBLE_PRECISION),
Values(QUDA_CLOVER_WILSON_DSLASH, QUDA_TWISTED_CLOVER_DSLASH), Values(false, true),
Values(1, 8), Values(false, true)),
gettestname);
This will add a nonsense test for doublet with regular clover fermions: you could deal with this by skipping this test if the clover / doublet options is encountered, or by splitting the instantiation into separate clover cases and twisted clover cases:
// regular clover (singlet only)
INSTANTIATE_TEST_SUITE_P(CloverForceTest, CloverForceTest,
Combine(Values(QUDA_SINGLE_PRECISION, QUDA_DOUBLE_PRECISION),
Values(QUDA_CLOVER_WILSON_DSLASH), Values(false, true),
Values(1, 8), Values(false)),
gettestname);
// twisted clover (singlet and doublet)
INSTANTIATE_TEST_SUITE_P(CloverForceTest, CloverForceTest,
Combine(Values(QUDA_SINGLE_PRECISION, QUDA_DOUBLE_PRECISION),
Values(QUDA_TWISTED_CLOVER_DSLASH), Values(false, true),
Values(1, 8), Values(false, true)),
gettestname);
Hope this helps
@maddyscientist thank you, I hope I manage to fix the tests with your suggestions. Some parameters were initialised with the command line arguments and not from INSTANTIATE_TEST_SUITE_P
. Now when the test is called with enable-testing 1
the parameter in INSTANTIATE_TEST_SUITE_P
will overwrite their command-line counterpart, at least the one I checked.
Regarding the performance I tried on a 64^3x128 lattice integrating 3 orders of the rational approximations (--nsrc 3
in a test) on 4 nodes of jewels-booster and I got:
tmLQCD native: 10.2713
tmLQCD + QUDA: 8.560943e-01
of which half 4.157508e-01 was spent in computeTMCloverForceQuda
. The rest is spent in tmLQCD preparing and collecting the result, I think it can be reduced using resident_solution and accumulating the momentum on the GPU but, so far, it isn't a significant amount in the overall HMC cost.
I fixed a few problems that were still present in the tests. Now there are three instantiations of the test parameters each of them guarded by the condition that the operator is compiled, hope this is ok
@maddyscientist What could be causing the incorrect timing of computeTMCloverForceQuda
that we seem to see now that the ND force has been offloaded?
computeTMCloverForceQuda Total time = 380.787 secs
download = 95.652 secs ( 25.119%), with 2582 calls at 3.705e+04 us per call
upload = 83.164 secs ( 21.840%), with 1033 calls at 8.051e+04 us per call
init = 14.246 secs ( 3.741%), with 26165 calls at 5.445e+02 us per call
compute = 8001.109 secs (2101.201%), with 292924 calls at 2.731e+04 us per call
comms = 54.024 secs ( 14.187%), with 6426 calls at 8.407e+03 us per call
free = 21.058 secs ( 5.530%), with 236599 calls at 8.901e+01 us per call
total accounted = 8269.254 secs (2171.620%)
total missing = -7888.466 secs (-2071.620%)
WARNING: Accounted time 8269.254 secs in computeTMCloverForceQuda is greater than total time 380.787 secs
Compared to the correct timing with just the light force offloaded?
computeTMCloverForceQuda Total time = 170.504 secs
download = 30.422 secs ( 17.842%), with 1033 calls at 2.945e+04 us per call
upload = 51.064 secs ( 29.949%), with 631 calls at 8.093e+04 us per call
init = 6.695 secs ( 3.926%), with 11704 calls at 5.720e+02 us per call
compute = 36.642 secs ( 21.491%), with 22085 calls at 1.659e+03 us per call
comms = 32.586 secs ( 19.112%), with 2524 calls at 1.291e+04 us per call
free = 10.476 secs ( 6.144%), with 129818 calls at 8.070e+01 us per call
total accounted = 167.885 secs ( 98.464%)
total missing = 2.619 secs ( 1.536%)
I think some timers were not stopped when multiple fermions were passed to computeTMCloverForceQuda
. This never happens in the light sector in tmLQCD.
@maddyscientist I found a similar problem in loadCloverQuda
which I hope I solved with a65cafb
I'm traveling this week, but will get to final testing next week. Looks like almost all issues have been resolved.
Thanks for catching the missing timer stopping with loadCloverQuda @Marcogarofalo
This looks good now, and I think it's just about ready to be merged. I fixed a couple of issues and pushed the changes. Namely whenever we specialize kernels, e.g., in this case the doublet specialization, we need to ensure that the different variants of the kernels have different tuneKey
values to ensure they use separate tunings. This ensures that the optimal parameters are used for the specializations, but more importantly prevents the possibility of an invalid launch config being used, e.g., the block/grid parameters for singlet may not be valid for doublet.
Anything left from your end @Marcogarofalo?
@maddyscientist thank you for fixing the issue with tuneKey
. I do not have any left open from my side.
cscs-ci run
cscs-ci run
I finished a quick visual review---I have no fundamental issues with the PR, this is fantastic work. All of my comments are pedantic, stylistic things. Getting these addressed before merging isn't a hill I'll die on.
Thank you @maddyscientist and @weinbe2 for all the support and help.
These are some changes I made to support the fermionic forces for the clover twisted mass non-degenerate fermion doubled. So far I only focused on getting the same result as in
tmLQCD
so, probably the changes I made are not optimal.