lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
279 stars 94 forks source link

Feature/compute clover force quda ndg #1452

Closed Marcogarofalo closed 2 months ago

Marcogarofalo commented 3 months ago

These are some changes I made to support the fermionic forces for the clover twisted mass non-degenerate fermion doubled. So far I only focused on getting the same result as in tmLQCD so, probably the changes I made are not optimal.

Marcogarofalo commented 3 months ago

I think that the CI actions do not trigger the test for the ndeg doublet. I try locally to run

mpirun -n 2 $1 ./tests/clover_force_test  \
        --dslash-type twisted-clover \
        --flavor nondeg-doublet \
        --compute-clover 1 \
        --matpc odd-odd-asym \
        --dagger \
        --solution-type mat-pc-dag-mat-pc \
        --dim  2 4 6 8 --prec ${prec} \
        --gridsize 2 1 1 1\
        --niter 1 \
        --compute-clover-trlog 0\
        --verbosity debug \
        --enable-testing 0\
        --determinant-ratio 0 \
        --nsrc 1

and it passes but I have problems if I set --enable-testing 1 another Dirac operator is called. From make test the test is launched with --enable-testing 1.

I am also trying to measure the performance in some realistic setup.

maddyscientist commented 3 months ago

@Marcogarofalo when you set enable-testing the predefined tests set at the bottom of clover_force_test.cpp will be run. What you need to do is add the non-degenerate case to this list. What I would suggest doing is adding a single/doublet option to testing tuple, e.g., add an extra bool

using test_t = ::testing::tuple<QudaPrecision, QudaDslashType, bool, int, bool>;

and then instantiate both single and double options in the test outer product

INSTANTIATE_TEST_SUITE_P(CloverForceTest, CloverForceTest,
                         Combine(Values(QUDA_SINGLE_PRECISION, QUDA_DOUBLE_PRECISION),
                                 Values(QUDA_CLOVER_WILSON_DSLASH, QUDA_TWISTED_CLOVER_DSLASH), Values(false, true),
                                 Values(1, 8), Values(false, true)),
                         gettestname);

This will add a nonsense test for doublet with regular clover fermions: you could deal with this by skipping this test if the clover / doublet options is encountered, or by splitting the instantiation into separate clover cases and twisted clover cases:

// regular clover (singlet only)
INSTANTIATE_TEST_SUITE_P(CloverForceTest, CloverForceTest,
                         Combine(Values(QUDA_SINGLE_PRECISION, QUDA_DOUBLE_PRECISION),
                                 Values(QUDA_CLOVER_WILSON_DSLASH), Values(false, true),
                                 Values(1, 8), Values(false)),
                         gettestname);

// twisted clover (singlet and doublet)
INSTANTIATE_TEST_SUITE_P(CloverForceTest, CloverForceTest,
                         Combine(Values(QUDA_SINGLE_PRECISION, QUDA_DOUBLE_PRECISION),
                                 Values(QUDA_TWISTED_CLOVER_DSLASH), Values(false, true),
                                 Values(1, 8), Values(false, true)),
                         gettestname);

Hope this helps

Marcogarofalo commented 3 months ago

@maddyscientist thank you, I hope I manage to fix the tests with your suggestions. Some parameters were initialised with the command line arguments and not from INSTANTIATE_TEST_SUITE_P. Now when the test is called with enable-testing 1 the parameter in INSTANTIATE_TEST_SUITE_P will overwrite their command-line counterpart, at least the one I checked.

Regarding the performance I tried on a 64^3x128 lattice integrating 3 orders of the rational approximations (--nsrc 3 in a test) on 4 nodes of jewels-booster and I got: tmLQCD native: 10.2713 tmLQCD + QUDA: 8.560943e-01
of which half 4.157508e-01 was spent in computeTMCloverForceQuda. The rest is spent in tmLQCD preparing and collecting the result, I think it can be reduced using resident_solution and accumulating the momentum on the GPU but, so far, it isn't a significant amount in the overall HMC cost.

Marcogarofalo commented 3 months ago

I fixed a few problems that were still present in the tests. Now there are three instantiations of the test parameters each of them guarded by the condition that the operator is compiled, hope this is ok

kostrzewa commented 2 months ago

@maddyscientist What could be causing the incorrect timing of computeTMCloverForceQuda that we seem to see now that the ND force has been offloaded?

   computeTMCloverForceQuda Total time =   380.787 secs
                 download     =    95.652 secs ( 25.119%),       with     2582 calls at 3.705e+04 us per call
                   upload     =    83.164 secs ( 21.840%),       with     1033 calls at 8.051e+04 us per call
                     init     =    14.246 secs (  3.741%),       with    26165 calls at 5.445e+02 us per call
                  compute     =  8001.109 secs (2101.201%),      with   292924 calls at 2.731e+04 us per call
                    comms     =    54.024 secs ( 14.187%),       with     6426 calls at 8.407e+03 us per call
                     free     =    21.058 secs (  5.530%),       with   236599 calls at 8.901e+01 us per call
        total accounted       =  8269.254 secs (2171.620%)
        total missing         = -7888.466 secs (-2071.620%)
WARNING: Accounted time  8269.254 secs in computeTMCloverForceQuda is greater than total time   380.787 secs

Compared to the correct timing with just the light force offloaded?

   computeTMCloverForceQuda Total time =   170.504 secs
                 download     =    30.422 secs ( 17.842%),       with     1033 calls at 2.945e+04 us per call
                   upload     =    51.064 secs ( 29.949%),       with      631 calls at 8.093e+04 us per call
                     init     =     6.695 secs (  3.926%),       with    11704 calls at 5.720e+02 us per call
                  compute     =    36.642 secs ( 21.491%),       with    22085 calls at 1.659e+03 us per call
                    comms     =    32.586 secs ( 19.112%),       with     2524 calls at 1.291e+04 us per call
                     free     =    10.476 secs (  6.144%),       with   129818 calls at 8.070e+01 us per call
        total accounted       =   167.885 secs ( 98.464%)
        total missing         =     2.619 secs (  1.536%)
Marcogarofalo commented 2 months ago

I think some timers were not stopped when multiple fermions were passed to computeTMCloverForceQuda. This never happens in the light sector in tmLQCD.

@maddyscientist I found a similar problem in loadCloverQuda which I hope I solved with a65cafb

maddyscientist commented 2 months ago

I'm traveling this week, but will get to final testing next week. Looks like almost all issues have been resolved.

Thanks for catching the missing timer stopping with loadCloverQuda @Marcogarofalo

maddyscientist commented 2 months ago

This looks good now, and I think it's just about ready to be merged. I fixed a couple of issues and pushed the changes. Namely whenever we specialize kernels, e.g., in this case the doublet specialization, we need to ensure that the different variants of the kernels have different tuneKey values to ensure they use separate tunings. This ensures that the optimal parameters are used for the specializations, but more importantly prevents the possibility of an invalid launch config being used, e.g., the block/grid parameters for singlet may not be valid for doublet.

Anything left from your end @Marcogarofalo?

Marcogarofalo commented 2 months ago

@maddyscientist thank you for fixing the issue with tuneKey. I do not have any left open from my side.

maddyscientist commented 2 months ago

cscs-ci run

maddyscientist commented 2 months ago

cscs-ci run

weinbe2 commented 2 months ago

I finished a quick visual review---I have no fundamental issues with the PR, this is fantastic work. All of my comments are pedantic, stylistic things. Getting these addressed before merging isn't a hill I'll die on.

Marcogarofalo commented 2 months ago

Thank you @maddyscientist and @weinbe2 for all the support and help.