apparent "number of iterations regression" in double-half CG for (nd)tmclover

kostrzewa commented 5 months ago

While testing force offloading I seem to have stumbled upon an issue with double-half CG leading to a substantial increase in the number of iterations between QUDA commit 02391b124 which we have been using for production quite a lot and b87195b38 which I've been using to test the fermionic force offloading. I observe this issue both for tmclover and ndtmclover. I will need a couple of comments to describe it fully, please bear with me.

Note that I had the actual offloading of the force disabled in all of these tests.

Let met start with an inversion in the force calculation for a tmclover determinant ratio with

kappa = 0.1373
mu = 0.003,
2kappamu =0.0008238
tm_rho = 0.008

so this is for a substantial shift in the spectrum via rho.

02391b124

# QUDA: CG: Convergence at 1456 iterations, L2 relative residual: iterated = 4.633279e-11, true = 4.633279e-11 (requested = 4.652313e-11)
# TM_QUDA: Time for invertQuda 6.169621e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio1:cloverdetratio_derivative/solve_degenerate/invert_eo
_degenerate_quda/invertQuda
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 3.122351e-03 s level: 4 proc_id: 0 /HMC/cloverdetratio1:cloverdetratio_derivative/solve_dege
nerate/invert_eo_degenerate_quda/reorder_spinor_eo_fromQuda
# TM_QUDA: QpQm solve done: 1456 iter / 0.603779 secs = 6268.97 Gflops

b87195b38

# QUDA: CG: Convergence at 1467 iterations, L2 relative residual: iterated = 4.647628e-11, true = 4.647628e-11 (requested = 4.652313e-11)
# TM_QUDA: Time for invertQuda 6.904932e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio1:cloverdetratio_derivative/solve_degenerate/invert_eo
_degenerate_quda/invertQuda
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.871629e-03 s level: 4 proc_id: 0 /HMC/cloverdetratio1:cloverdetratio_derivative/solve_dege
nerate/invert_eo_degenerate_quda/reorder_spinor_eo_fromQuda
# TM_QUDA: QpQm solve done: 1467 iter / 0.690324 secs = 5626.56 Gflops

Seems perfectly acceptable.

Moving to a determinant ratio with rho = 0.0 (everything else being the same):

02391b124

# QUDA: CG: Convergence at 7673 iterations, L2 relative residual: iterated = 4.594649e-11, true = 4.594649e-11 (requested = 4.650678e-11)
# TM_QUDA: Time for invertQuda 3.155736e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio2:cloverdetratio_derivative/solve_degenerate/invert_eo
_degenerate_quda/invertQuda
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.945675e-03 s level: 4 proc_id: 0 /HMC/cloverdetratio2:cloverdetratio_derivative/solve_dege
nerate/invert_eo_degenerate_quda/reorder_spinor_eo_fromQuda
# TM_QUDA: QpQm solve done: 7673 iter / 3.14216 secs = 6309.95 Gflops

b87195b38

# QUDA: CG: Convergence at 18814 iterations, L2 relative residual: iterated = 4.647872e-11, true = 4.647872e-11 (requested = 4.650678e-11)
# TM_QUDA: Time for invertQuda 8.945462e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio2:cloverdetratio_derivative/solve_degenerate/invert_eo
_degenerate_quda/invertQuda
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.953089e-03 s level: 4 proc_id: 0 /HMC/cloverdetratio2:cloverdetratio_derivative/solve_dege
nerate/invert_eo_degenerate_quda/reorder_spinor_eo_fromQuda
# TM_QUDA: QpQm solve done: 18814 iter / 8.94527 secs = 5550.27 Gflops

Oh dear. This is on a quda-A100 system (single node) on a 24c48 lattice.

kostrzewa commented 5 months ago

UPDATE: I think this is unrelated to the above as it appears to be a problem with P2P and the "unofficial" ROCm 5.6.1 on LUMI-G. Working with https://github.com/lattice/quda/commit/02391b124e38addc6cd70c722bf93b344d9b8052 and ROCm 5.6.1 with P2P enabled leads to the same issue in double-half refinement for our ND monomials.

Here's what happens in a large-scale run on LUMI-G on a 128c256 lattice in the heatbath step for an ND tmclover monomial with the partial fractions with the 4 largest shifts in our RHMC.

I'm using a single-precision solve with double-half refinement here.

https://github.com/lattice/quda/commit/02391b124e38addc6cd70c722bf93b344d9b8052

# TM_QUDA: mu = 0.087911000000, epsilon = 0.086224000000 kappa = 0.137972174000, csw = 1.611200000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 4.900265e-02 s level: 5 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_ms
hift/reorder_spinor_eo_toQuda
MultiShiftCG: Converged after 23 iterations
MultiShiftCG:  shift=0, 23 iterations, relative residual: iterated = 1.378145e-05
MultiShiftCG:  shift=1, 23 iterations, relative residual: iterated = 5.730120e-10
MultiShiftCG:  shift=2, 11 iterations, relative residual: iterated = 8.851856e-10
MultiShiftCG:  shift=3, 5 iterations, relative residual: iterated = 8.974634e-10
# QUDA: Refining shift 0: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 34 iterations, L2 relative residual: iterated = 7.411007e-12, true = 7.411007e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 1: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 10 iterations, L2 relative residual: iterated = 7.232123e-12, true = 7.232123e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 2: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 5 iterations, L2 relative residual: iterated = 3.434280e-12, true = 3.434280e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 3: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 2 iterations, L2 relative residual: iterated = 5.669652e-12, true = 5.669652e-12 (requested = 1.000000e-11)
# TM_QUDA: Time for invertMultiShiftQuda 3.061032e+00 s level: 5 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/invertMultiShiftQuda
[...]
# TM_QUDA: QpQm solve done: 74 iter / 2.53316 secs = 162961 Gflops

As you can see, this converges very quickly.

Instead, with a newer QUDA commit (unfortunately not https://github.com/lattice/quda/commit/b87195b38416648885c1d602dd6395b9a60ee269 but 273d4fe8d)

273d4fe8d

# TM_QUDA: mu = 0.087911000000, epsilon = 0.086224000000 kappa = 0.137972174000, csw = 1.611200000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 4.764375e-02 s level: 5 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/reorder_spinor_eo_toQuda
MultiShiftCG: Saving 356 sets of cached parameters to /scratch/project_465000726/bartek/tests/hmc_128c256_tests/jobscript/p2p3_gdr1_gnu_env_23_09_rocm_561/tunecache.tsv
MultiShiftCG: Converged after 23 iterations
MultiShiftCG:  shift=0, 23 iterations, relative residual: iterated = 1.404610e-05
MultiShiftCG:  shift=1, 23 iterations, relative residual: iterated = 5.944510e-10
MultiShiftCG:  shift=2, 11 iterations, relative residual: iterated = 8.809417e-10
MultiShiftCG:  shift=3, 5 iterations, relative residual: iterated = 8.976188e-10
# QUDA: Refining shift 0: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: WARNING: Unexpected regression when tuning candidates for N4quda17NdegTwistedCloverINS_20NdegTwistedCloverArgIsLi3ELi4EL21QudaReconstructType_s18EEEEE: (0.000602298 > 1.1 * 0.000527044)
# QUDA: WARNING: Exceeded maximum iterations 5000
# QUDA: CG: Convergence at 5000 iterations, L2 relative residual: iterated = 2.957992e-09, true = 2.957992e-09 (requested = 1.000000e-11)
# QUDA: Refining shift 1: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 10 iterations, L2 relative residual: iterated = 7.381786e-12, true = 7.381786e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 2: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 5 iterations, L2 relative residual: iterated = 3.318602e-12, true = 3.318602e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 3: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 2 iterations, L2 relative residual: iterated = 5.672100e-12, true = 5.672100e-12 (requested = 1.000000e-11)
# QUDA: Saving 540 sets of cached parameters to /scratch/project_465000726/bartek/tests/hmc_128c256_tests/jobscript/p2p3_gdr1_gnu_env_23_09_rocm_561/tunecache.tsv
# TM_QUDA: Time for invertMultiShiftQuda 7.038127e+01 s level: 5 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/invertMultiShiftQuda
[...]
# TM_QUDA: QpQm solve done: 5040 iter / 70.3788 secs = 359312 Gflops

For the smallest shift the refinement does not even converge, even though this shift is really very large.

kostrzewa commented 5 months ago

The very annoying thing is that I can't really reproduce this behaviour everywhere and in all situations.

On the same machine (quad-A100, single node) as in https://github.com/lattice/quda/issues/1433#issue-2097950351 on a 24c48 lattice running "nf=1+1" non-degenerate twisted clover RHMC with three monomials split over three time scales and employing single precision multi-shift with subsequent double-half refinement I have essentially compatible behaviour between https://github.com/lattice/quda/commit/02391b124e38addc6cd70c722bf93b344d9b8052 (left) and https://github.com/lattice/quda/commit/b87195b38416648885c1d602dd6395b9a60ee269 (right)

# TM_QUDA: QpQm solve done: 62 iter / 0.061139 secs = 7031.55 Gflops |  # TM_QUDA: QpQm solve done: 62 iter / 69.6087 secs = 1263.4 Gflops   (TUNING on RHS)
# TM_QUDA: QpQm solve done: 433 iter / 0.304325 secs = 7979.13 Gflop |  # TM_QUDA: QpQm solve done: 433 iter / 0.41612 secs = 5967.94 Gflops
# TM_QUDA: QpQm solve done: 2036 iter / 1.4395 secs = 7688.16 Gflops |  # TM_QUDA: QpQm solve done: 2032 iter / 1.63329 secs = 6879.78 Gflop
# TM_QUDA: QpQm solve done: 3085 iter / 2.22045 secs = 7647.43 Gflop |  # TM_QUDA: QpQm solve done: 3087 iter / 2.72416 secs = 6364.03 Gflop
# TM_QUDA: QpQm solve done: 3978 iter / 2.80133 secs = 7792.77 Gflop |  # TM_QUDA: QpQm solve done: 3976 iter / 3.23655 secs = 6864.81 Gflop
# TM_QUDA: QpQm solve done: 3069 iter / 2.20632 secs = 7656.9 Gflops |  # TM_QUDA: QpQm solve done: 3072 iter / 2.58824 secs = 6657.26 Gflop
# TM_QUDA: QpQm solve done: 3958 iter / 2.78638 secs = 7795.68 Gflop |  # TM_QUDA: QpQm solve done: 3961 iter / 3.22644 secs = 6862.15 Gflop
# TM_QUDA: QpQm solve done: 1095 iter / 0.929441 secs = 6451.43 Gflo |  # TM_QUDA: QpQm solve done: 1097 iter / 1.06853 secs = 5725.28 Gflop
# TM_QUDA: QpQm solve done: 311 iter / 0.234165 secs = 7388.53 Gflop |  # TM_QUDA: QpQm solve done: 312 iter / 0.317038 secs = 5613.07 Gflop
# TM_QUDA: QpQm solve done: 48 iter / 0.042635 secs = 7324.08 Gflops |  # TM_QUDA: QpQm solve done: 48 iter / 0.123382 secs = 2751.01 Gflops
# TM_QUDA: QpQm solve done: 48 iter / 0.047692 secs = 6547.48 Gflops |  # TM_QUDA: QpQm solve done: 48 iter / 0.134458 secs = 2524.4 Gflops
# TM_QUDA: QpQm solve done: 48 iter / 0.04713 secs = 6625.55 Gflops  |  # TM_QUDA: QpQm solve done: 48 iter / 0.129718 secs = 2616.64 Gflops
# TM_QUDA: QpQm solve done: 48 iter / 0.047155 secs = 6622.04 Gflops |  # TM_QUDA: QpQm solve done: 48 iter / 0.129624 secs = 2618.54 Gflops
# TM_QUDA: QpQm solve done: 48 iter / 0.047122 secs = 6626.68 Gflops |  # TM_QUDA: QpQm solve done: 48 iter / 0.129388 secs = 2623.31 Gflops
# TM_QUDA: QpQm solve done: 312 iter / 0.247586 secs = 7009.56 Gflop |  # TM_QUDA: QpQm solve done: 312 iter / 0.334704 secs = 5316.81 Gflop
# TM_QUDA: QpQm solve done: 48 iter / 0.042963 secs = 7268.17 Gflops |  # TM_QUDA: QpQm solve done: 48 iter / 0.123824 secs = 2741.19 Gflops
# TM_QUDA: QpQm solve done: 48 iter / 0.047345 secs = 6595.46 Gflops |  # TM_QUDA: QpQm solve done: 48 iter / 0.129276 secs = 2625.59 Gflops
# TM_QUDA: QpQm solve done: 48 iter / 0.047191 secs = 6616.99 Gflops |  # TM_QUDA: QpQm solve done: 48 iter / 0.127019 secs = 2672.24 Gflops
# TM_QUDA: QpQm solve done: 48 iter / 0.047401 secs = 6587.67 Gflops |  # TM_QUDA: QpQm solve done: 48 iter / 0.127782 secs = 2656.29 Gflops
# TM_QUDA: QpQm solve done: 311 iter / 0.25027 secs = 6913.07 Gflops |  # TM_QUDA: QpQm solve done: 312 iter / 0.327226 secs = 5438.31 Gflop
# TM_QUDA: QpQm solve done: 48 iter / 0.042919 secs = 7275.62 Gflops |  # TM_QUDA: QpQm solve done: 48 iter / 0.123362 secs = 2751.46 Gflops
# TM_QUDA: QpQm solve done: 48 iter / 0.047261 secs = 6607.19 Gflops |  # TM_QUDA: QpQm solve done: 48 iter / 0.129535 secs = 2620.34 Gflops
# TM_QUDA: QpQm solve done: 48 iter / 0.04713 secs = 6625.55 Gflops  |  # TM_QUDA: QpQm solve done: 48 iter / 0.129128 secs = 2628.6 Gflops
# TM_QUDA: QpQm solve done: 48 iter / 0.047052 secs = 6636.54 Gflops |  # TM_QUDA: QpQm solve done: 48 iter / 0.129753 secs = 2615.94 Gflops

where the iteration counts shown here reflect the sum of the number of iterations performed by the initial single precision multishift solve and the subsequent double-half refinements. Different lines correspond to different monomials at various points in time along the trajectory. The two histories are identical in terms of starting configuration and random numbers used for the momenta, which is also reflected in the histories which show dH compatible up to including around the 4th decimal place and the plaquette compatible up to the 10th decimal place or so after O(20) trajectories.

I did notice the significant difference in timing and measured performance in the last couple of lines where only very few iterations were performed.

The two builds are identically configured. I can check later if the profile_*.tsv show any major differences at the kernel level.

kostrzewa commented 5 months ago

Okay, I've figured out that the culprit was actually a difference in our input file for the two test cases in https://github.com/lattice/quda/issues/1433#issue-2097950351 above. The "working" run had inv_param.pipeline = 0 set while the one with poor convergence had inv_param.pipeline > 0.

I'll have to reevaluate whether we need to set this on a per-monomial basis. Any recommendations @maddyscientist ?

lattice / quda