mild failures in some DW ctests

jcosborn commented 10 months ago

invert_test_mobius_eofa_sym_single and invert_test_domain_wall_double from develop branch fail a few tests (4 each) on A100 and Geforce due to the host norm.

An example from invert_test_mobius_eofa_sym_single is:

[ RUN      ] HeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_l2_heavy_quark
Solution = mat_pc, Solve = normop_pc, Solver = cg, Sloppy precision = single
WARNING: CG: Restarting without reliable updates for heavy-quark residual (total #inc 1)
CG: Convergence at 72 iterations, L2 relative residual: iterated = 6.671841e-07, true = 6.671841e-07 (requested = 1.000000e-06), heavy-quark residual = 7.242592e-07 (requested = 1.000000e-06)
Done: 72 iter / 0.017796 secs = 24.9551 Gflops
Residuals: (L2 relative) tol 1.000000e-06, QUDA = 6.671841e-07, host = 1.350112e-06; (heavy-quark) tol 1.000000e-06, QUDA = 7.242592e-07
/home/osborn/lqcd/src/quda-git/tests/invert_test_gtest.hpp:144: Failure
Expected: (rsd[0]) <= (tol), actual: 1.35011e-06 vs 1e-06
[  FAILED  ] HeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 1, (-2147483648, -2147483648, -2147483648), 5) (21 ms)

An example from invert_test_domain_wall_double is:

[ RUN      ] HeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_double_l2_heavy_quark
Solution = mat_pc, Solve = normop_pc, Solver = cg, Sloppy precision = double
WARNING: CG: Restarting without reliable updates for heavy-quark residual (total #inc 1)
CG: Convergence at 71 iterations, L2 relative residual: iterated = 6.754308e-13, true = 6.754308e-13 (requested = 1.000000e-12), heavy-quark residual = 7.197039e-13 (requested = 1.000000e-12)
Done: 71 iter / 0.016504 secs = 23.8195 Gflops
Residuals: (L2 relative) tol 1.000000e-12, QUDA = 6.754308e-13, host = 1.091590e-12; (heavy-quark) tol 1.000000e-12, QUDA = 7.197039e-13
/home/osborn/lqcd/src/quda-git/tests/invert_test_gtest.hpp:144: Failure
Expected: (rsd[0]) <= (tol), actual: 1.09159e-12 vs 1e-12
[  FAILED  ] HeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_double_l2_heavy_quark, where GetParam() = (0, 2, 3, 8, 1, 1, (-2147483648, -2147483648, -2147483648), 5) (19 ms)

I also see some cases where the QUDA residual is larger than the tolerance, but the test still passes. An example from invert_test_mobius_eofa_sym_single is:

[ RUN      ] EvenOdd/InvertTest.verify/cgnr_mat_pc_direct_pc_single_l2
Solution = mat_pc, Solve = direct_pc, Solver = cgnr, Sloppy precision = single
CG: Convergence at 70 iterations, L2 relative residual: iterated = 9.426984e-07, true = 9.426984e-07 (requested = 1.000000e-06)
CGNR: Convergence at 70 iterations, L2 relative residual: iterated = 1.831966e-06, true = 1.831966e-06 (requested = 1.000000e-06)
Done: 70 iter / 0.017316 secs = 25.0976 Gflops
Residuals: (L2 relative) tol 1.000000e-06, QUDA = 1.831966e-06, host = 1.831628e-06; (heavy-quark) tol 0.000000e+00, QUDA = 1.025106e-06
[       OK ] EvenOdd/InvertTest.verify/cgnr_mat_pc_direct_pc_single_l2 (20 ms)

hummingtree commented 10 months ago

@jcosborn I tried but cannot reproduce the error you are seeing.

For the first and second case, I believe the error is from discrepancy between CPU and GPU computation (reduction from the 5th dimension). I will file a PR to relax the tol by a factor of square root of Ls.
For the third (non-error) case you mentioned, it is intended that no error is reported: for CGNR the true error and the iterative error are meant to be different.

jcosborn commented 10 months ago

Thanks for looking into it. I saw this on a few different systems with different setups so though it would be easier to reproduce. For completeness here's the configure output for the A100 build. I can try again with different SDKs to see if that changes it.

-- QUDA 1.1.0 (0c016af70) ** -- cmake version: 3.27.2 -- Source location: /home/osborn/lqcd/src/quda-git -- Build location: /home/osborn/lqcd/build/quda-git -- Build type: RELEASE -- QUDA target: CUDA -- The CXX compiler identification is NVHPC 23.7.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /soft/compilers/nvhpc/Linux_x86_64/23.7/compilers/bin/nvc++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- The C compiler identification is NVHPC 23.7.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /soft/compilers/nvhpc/Linux_x86_64/23.7/compilers/bin/nvc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE
-- CPM: Adding package Eigen@3.4.0 (3.4.0) -- QUDA_mdw_fused_Ls=4,8,12,16,20 -- QUDA_MULTIGRID_NVEC_LIST=6,24,32 -- QUDA_MULTIGRID_MRHS_LIST=16 -- Found CUDAToolkit: /soft/compilers/nvhpc/Linux_x86_64/23.7/cuda/12.2/include (found version "12.2.91") -- Building QUDA for GPU ARCH sm_80 -- The CUDA compiler identification is NVIDIA 12.2.91 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /soft/compilers/nvhpc/Linux_x86_64/23.7/compilers/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- CUDA Compiler is/soft/compilers/nvhpc/Linux_x86_64/23.7/compilers/bin/nvcc -- Compiler ID is NVIDIA -- CUDA Build Type: NVCC -- Heterogeneous atomics supported: ON -- QUDA_MULTIGRID_MRHS_LIST=16 -- Performing Test QUDA_LINKER_COMPRESS -- Performing Test QUDA_LINKER_COMPRESS - Success -- Performing Test QUDA_COMPRESS_DEBUG -- Performing Test QUDA_COMPRESS_DEBUG - Failed

jcosborn commented 10 months ago

I get the same test failures with CUDA 11.6.2 + GCC 11.1 and CUDA 12.0.0 + GCC 12.2.

lattice / quda

mild failures in some DW ctests #1410