lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

`site unroll not supported for nSpin = 2 nColor = 32` in coarse-grid-deflated MG #1378

Open kostrzewa opened 1 year ago

kostrzewa commented 1 year ago

This is an issue which we encountered already quite some time ago but haven't had time to report yet.

When running coarse-grid-deflated MG from within tmLQCD using a relatively "recent" commit of QUDA's develop branch (32bb266c) I encounter:

MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 19, host nid006334, reduce_quda.cu:76 in virtual void quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(const quda::qudaStream_t &) [Reducer = quda::blas::Norm2, store_t = short, y_store_t = short, nSpin = 4, coeff_t = double]())
MG level 2 (GPU):        last kernel called was (name=N4quda7RNGInitE,volume=4x4x2x4,aux=GPU-offline,vol=128,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1)

I know that switching to much older commits "solves" this, so that's something we can explore if necessary (I don't know how compatible the current version of our interface is with these older QUDA versions).

I'm testing with higher verbosity to see what's going on but perhaps you might already have a change in mind from the past couple of months which could have caused this?

kostrzewa commented 1 year ago

Note that everything works fine when I disable coarse-grid deflation.

The failure seems to occur during the launch of the eigensolver:

[...]
MG level 1 (GPU): Computing Y field......
MG level 1 (GPU): ....done computing Y field
MG level 1 (GPU): Computing Yhat field......
MG level 1 (GPU): ....done computing Yhat field
MG level 2 (GPU): Using randStateMRG32k3a
MG level 2 (GPU): Tuned block=(128,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.00 GB/s for N4quda7RNGInitE
 with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): Creating level 2
MG level 2 (GPU): Creating smoother
MG level 2 (GPU): Smoother done
MG level 2 (GPU): Setup of level 2 done
MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 0, host nid005036, reduce_quda.cu:76 in virtual void qud
a::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(const quda::qudaStream_t &) [Reducer = quda::blas::Norm2, store_t = shor
t, y_store_t = short, nSpin = 4, coeff_t = double]())
kostrzewa commented 1 year ago

Not that everything works fine when I disable coarse-grid deflation.

This was meant to read Note :)

kostrzewa commented 1 year ago

Some more context:

# QUDA: QUDA Inverter Parameters:
# QUDA: struct_size = -2147483648
# QUDA: dslash_type = 10
# QUDA: inv_type = 2
# QUDA: kappa = 0.139427
# QUDA: mu = -0.00072
# QUDA: twist_flavor = 1
# QUDA: tm_rho = 0
# QUDA: tol = 1e-10
# QUDA: residual_type = 1
# QUDA: maxiter = 250
# QUDA: reliable_delta = 0.01
# QUDA: reliable_delta_refinement = 0.0001
# QUDA: use_alternative_reliable = 0
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: solution_accumulator_pipeline = 1
# QUDA: max_res_increase = 10
# QUDA: max_res_increase_total = 40
# QUDA: max_hq_res_increase = 1
# QUDA: max_hq_res_restart_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 24
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: split_grid[d] = 1
# QUDA: split_grid[d] = 1
# QUDA: split_grid[d] = 1
# QUDA: split_grid[d] = 1
# QUDA: num_src_per_sub_partition = 1
# QUDA: compute_action = 0
# QUDA: compute_true_res = 1
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 1
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: cuda_prec_refinement_sloppy = 8
# QUDA: cuda_prec_precondition = 2
# QUDA: cuda_prec_eigensolver = 2
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 2
# QUDA: gamma_basis = 2
# QUDA: dirac_order = 1
# QUDA: gcrNkrylov = 24
# QUDA: madwf_param_load = 0
# QUDA: madwf_param_save = 0
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: struct_size = -2147483648
# QUDA: clover_location = 2
# QUDA: clover_cpu_prec = 8
# QUDA: clover_cuda_prec = 8
# QUDA: clover_cuda_prec_sloppy = 4
# QUDA: clover_cuda_prec_refinement_sloppy = 8
# QUDA: clover_cuda_prec_precondition = 2
# QUDA: clover_cuda_prec_eigensolver = 2
# QUDA: compute_clover_trlog = 1
# QUDA: compute_clover = 1
# QUDA: compute_clover_inverse = 1
# QUDA: return_clover = 0
# QUDA: return_clover_inverse = 0
# QUDA: clover_rho = 0
# QUDA: clover_coeff = 0.235631
# QUDA: clover_csw = 0
# QUDA: clover_order = 9
# QUDA: verbosity = 2
# QUDA: iter = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: cuda_prec_ritz = 4
# QUDA: n_ev = 8
# QUDA: max_search_dim = 64
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 1
# QUDA: eigcg_max_restarts = 4
# QUDA: max_restart_num = 3
# QUDA: tol_restart = 5e-05
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0
# QUDA: chrono_use_resident = 0
# QUDA: chrono_make_resident = 0
# QUDA: chrono_replace_last = 0
# QUDA: chrono_max_dim = 0
# QUDA: chrono_index = 0
# QUDA: chrono_precision = 4
# QUDA: extlib_type = 1
# QUDA: native_blas_lapack = 1
# QUDA: use_mobius_fused_kernel = 1

and the setup process seems to work fine:

[...]
MG level 0 (GPU): CG: Convergence at 316 iterations, L2 relative residual: iterated = 4.996240e-07, true = 4.996240e-07 (requested = 5.000000e-07)
MG level 0 (GPU): Computing Y field......
MG level 0 (GPU): ....done computing Y field
MG level 0 (GPU): Computing Yhat field......
MG level 0 (GPU): ....done computing Yhat field
MG level 1 (GPU): WARNING: Exceeded maximum iterations 1500
MG level 1 (GPU): CG: Convergence at 1500 iterations, L2 relative residual: iterated = 1.924570e-06, true = 1.935714e-06 (requested = 5.000000e-07)
[...]
MG level 1 (GPU): WARNING: Exceeded maximum iterations 1500
MG level 1 (GPU): CG: Convergence at 1500 iterations, L2 relative residual: iterated = 2.006127e-06, true = 2.013764e-06 (requested = 5.000000e-07)
MG level 1 (GPU): Computing Y field......
MG level 1 (GPU): ....done computing Y field
MG level 1 (GPU): Computing Yhat field......
MG level 1 (GPU): ....done computing Yhat field
MG level 2 (GPU): Using randStateMRG32k3a
MG level 2 (GPU): Tuned block=(64,1,1), grid=(2,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.00 GB/s for N4quda7RNGInitE with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): Creating level 2
MG level 2 (GPU): Creating smoother
MG level 2 (GPU): Smoother done
MG level 2 (GPU): Setup of level 2 done
MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 0, host nid005910, reduce_quda.cu:76 in virtual void quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(const quda::qudaStream_t &) [Reducer = quda::blas::Norm2, store_t = short, y_store_t = short, nSpin = 4, coeff_t = double]())
MG level 2 (GPU):        last kernel called was (name=N4quda7RNGInitE,volume=4x4x2x4,aux=GPU-offline,vol=128,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1)
Local seed is 144041342  proc_id = 2

where I used CUDA_LAUNCH_BLOCKING=1.

kostrzewa commented 1 year ago

DEBUG_VERBOSE on level 2:

MG level 2 (GPU): Using randStateMRG32k3a
MG level 2 (GPU): Allocated array of random numbers with size: 0.01 MB
MG level 2 (GPU): PreTune N4quda7RNGInitE
MG level 2 (GPU): Tuning N4quda7RNGInitE with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1 at vol=8x4x2x4
MG level 2 (GPU): About to call tunable.apply block=(64,1,1) grid=(2,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(64,1,1), grid=(2,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(128,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(128,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(192,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(192,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(256,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(256,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(320,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(320,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(384,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(384,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(448,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(448,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(512,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(512,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(576,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(576,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(640,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(640,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(704,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(704,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(768,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(768,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(832,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(832,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(896,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(896,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(960,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(960,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(1024,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(1024,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(64,2,1) grid=(2,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(64,2,1), grid=(2,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(128,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(128,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(192,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(192,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(256,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(256,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(320,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(320,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(384,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(384,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(448,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(448,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(512,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(512,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): Candidate tuning finished for N4quda7RNGInitE with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1. Best time 0.000016 and now continuing with 62 iterations.
MG level 2 (GPU): About to call tunable.apply block=(192,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(192,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(256,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(256,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(64,1,1) grid=(2,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(64,1,1), grid=(2,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(128,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(128,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(64,2,1) grid=(2,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(64,2,1), grid=(2,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(128,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(128,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): Tuned block=(64,1,1), grid=(2,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.00 GB/s for N4quda7RNGInitE with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): PostTune N4quda7RNGInitE
MG level 2 (GPU): Creating level 2
MG level 2 (GPU): Creating smoother
MG level 2 (GPU): Smoother done
MG level 2 (GPU): Setup of level 2 done
MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 0, host nid005327, reduce_quda.cu:76 in virtual void quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(const quda::qudaStream_t &) [Reducer = quda::blas::Norm2, store_t = short, y_store_t = short, nSpin = 4, coeff_t = double]())
MG level 2 (GPU):        last kernel called was (name=N4quda7RNGInitE,volume=4x4x2x4,aux=GPU-offline,vol=128,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1)
Local seed is 144041342  proc_id = 2
kostrzewa commented 1 year ago

Increasing the level of verbosity step-by-step, in particular increasing the verbosity on level 1, reveals some more details on where this is failing (likely because of buffers being flushed more frequently):

MG level 2 (GPU): PostTune N4quda7RNGInitE
MG level 2 (GPU): Creating level 2
MG level 2 (GPU): Creating smoother
MG level 2 (GPU): Smoother done
MG level 2 (GPU): Setup of level 2 done
MG level 1 (GPU): Creating coarse solver wrapper
MG level 1 (GPU): Creating a CA-GCR solver
MG level 1 (GPU): Tuned block=(64,1,1), grid=(2,2,1), shared_bytes=6401, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 4.93 GB/s for N4quda11Spino
rNoiseIfLi2ELi32EEE with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,uniform
MG level 2 (GPU): Tuned block=(64,1,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 23.71 GB/s for hipMemsetAsync 
with zero,color_spinor_field.cpp,436
MG level 2 (GPU): Tuned block=(64,1,4), grid=(16,1,16), shared_bytes=8001, aux=(8,1,1,1) giving 548.10 Gflop/s, 292.32 GB/s for N4quda12Dsl
ashCoarseIfssLi2ELi32ELb0ELb1ELb0ELNS_10DslashTypeE2EEE with policy_kernel,GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,Twis
tFlavor=1,comm=0111,full,halo=00111111,n_rhs=1
MG level 2 (GPU): Tuned block=(64,1,4), grid=(16,1,16), shared_bytes=4572, aux=(8,1,1,1) giving 559.01 Gflop/s, 298.14 GB/s for N4quda12DslashCoarseIfssLi2ELi32ELb0ELb1ELb0ELNS_10DslashTypeE2EEE with policy_kernel,GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,comm=0111,full,halo=00222222,n_rhs=1
MG level 2 (GPU): Tuned block=(64,1,1), grid=(1,1,1), shared_bytes=0, aux=(0,0,0,0) giving 545.83 Gflop/s, 291.11 GB/s for N4quda22DslashCoarsePolicyTuneINS_18DslashCoarseLaunchILb0ELi32EEEEE with policy,clover,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,gauge_prec=2,halo_prec=2,comm=0111,topo=1244,p2p=0,gdr=1,nvshmem=0,pol=11110011111,full,n_rhs=1
MG level 2 (GPU): Tuned block=(16,16,1), grid=(16,1,1), shared_bytes=4096, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.97 GB/s for N4quda9GhostPackIfsL16QudaFieldOrder_s2ELi2ELi32EEE with GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,halo_prec=2,comm=0111,topo=1244,dest=00111111,nFace=1,spins_per_thread=2,colors_per_thread=2,shmem=0,batched
MG level 2 (GPU): Tuned block=(64,1,4), grid=(8,1,32), shared_bytes=16384, aux=(4,2,1,1) giving 1003.82 Gflop/s, 519.81 GB/s for N4quda12DslashCoarseIfssLi2ELi32ELb1ELb0ELb0ELNS_10DslashTypeE2EEE with policy_kernel,GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,comm=0111,full,halo=00111111,n_rhs=1
MG level 2 (GPU): Tuned block=(16,16,1), grid=(16,1,1), shared_bytes=4096, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.97 GB/s for N4quda9GhostPackIfsL16QudaFieldOrder_s2ELi2ELi32EEE with GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,halo_prec=2,comm=0111,topo=1244,dest=00222222,nFace=1,spins_per_thread=2,colors_per_thread=2,shmem=0,batched
MG level 2 (GPU): Tuned block=(64,1,4), grid=(8,1,16), shared_bytes=8001, aux=(4,1,1,1) giving 269.55 Gflop/s, 139.58 GB/s for N4quda12DslashCoarseIfssLi2ELi32ELb1ELb0ELb0ELNS_10DslashTypeE2EEE with policy_kernel,GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,comm=0111,full,halo=00222222,n_rhs=1
MG level 2 (GPU): Tuned block=(64,1,1), grid=(1,1,1), shared_bytes=0, aux=(1,0,0,0) giving 73.15 Gflop/s, 37.88 GB/s for N4quda22DslashCoarsePolicyTuneINS_18DslashCoarseLaunchILb0ELi32EEEEE with policy,dslash,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,gauge_prec=2,halo_prec=2,comm=0111,topo=1244,p2p=0,gdr=1,nvshmem=0,pol=11110011111,full,n_rhs=1
MG level 2 (GPU): Tuned block=(192,1,1), grid=(220,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 9.40 Gflop/s, 37.62 GB/s for N4quda4blas7axpbyz_IfEE with GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): Tuned block=(256,1,1), grid=(2,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 1.90 Gflop/s, 3.79 GB/s for N4quda4blas5Norm2IdfEE with GPU-offline,nParity=1,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): Creating TR Lanczos eigensolver
MG level 2 (GPU): Tuned block=(64,1,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 15.64 GB/s for hipMemsetAsync with zero,color_spinor_field.cpp,436
MG level 2 (GPU): Running eigensolver in half precision
MG level 2 (GPU): Using randStateMRG32k3a
MG level 2 (GPU): Tuned block=(128,1,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.00 GB/s for N4quda7RNGInitE with GPU-offline,vol=128,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 0, host nid006249, reduce_quda.cu:76 in virtual void quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(const quda::qudaStream_t &) [Reducer = quda::blas::Norm2, store_t = short, y_store_t = short, nSpin = 4, coeff_t = double]())
MG level 2 (GPU):        last kernel called was (name=N4quda7RNGInitE,volume=4x4x2x4,aux=GPU-offline,vol=128,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1)

At least from the last few lines here it appears that the issue occurs in the eigensolver, running with a global QUDA_DEBUG_VERBOSE now to see what exactly is happening.

kostrzewa commented 1 year ago

After adding a manual debug statement I've figured out that the issue comes from here:

https://github.com/lattice/quda/blob/68d7d2004a3a5abec87daed53392edc1ce060593/lib/eigensolve_quda.cpp#L152-L175

In particular, the issue seems to be with blas::norm2(kSpace[b]).

prepareInitialGuess(kSpace) is called from

https://github.com/lattice/quda/blob/68d7d2004a3a5abec87daed53392edc1ce060593/lib/eig_trlm.cpp#L40-L58

as far as I can tell.

I'm wondering if the check in

https://github.com/lattice/quda/blob/68d7d2004a3a5abec87daed53392edc1ce060593/lib/reduce_quda.cu#L72-L77

is warranted if the eigensolver (and hence blas::norm2) is to be used on the coarse operator. On the other hand, it has been in place for a long time, also when the coarse-deflated MG was working IIRC:

927d04d1a0 (Dean Howarth   2020-05-28 05:35:20 -0700  72)       void apply(const qudaStream_t &stream)
fe7252cba2 (Mathias Wagner 2019-04-04 23:23:12 +0200  73)       {
073f2d93cf (maddyscientist 2020-07-01 08:26:03 -0700  74)         constexpr bool site_unroll_check = !std::is_same<store_t, y_store_t>::value || isFixed<store_t>::value || decltype(r)::site_unroll;
073f2d93cf (maddyscientist 2020-07-01 08:26:03 -0700  75)         if (site_unroll_check && (x.Ncolor() != 3 || x.Nspin() == 2))
cb485e74b4 (maddyscientist 2020-06-17 13:49:25 -0700  76)           errorQuda("site unroll not supported for nSpin = %d nColor = %d", x.Nspin(), x.Ncolor());
cb485e74b4 (maddyscientist 2020-06-17 13:49:25 -0700  77) 

and there are no changes in multigrid.cpp which seem to suggest that anything was changed...

maddyscientist commented 1 year ago

Hi @kostrzewa. This issue looks like a precision one I think: I don't think we should ever be using half precision on the coarse grids here. Can you enable QUDA_BACKWARDS=ON so I can see exactly where this being called?

FWIW: the "site unrolling" refers to the fact that the entire site (all spin and color for a given site in spacetime) is handled by a single thread.

kostrzewa commented 1 year ago

This issue looks like a precision one I think: I don't think we should ever be using half precision on the coarse grids here.

Thanks for this hint! Setting all [clover_]cuda_prec_precondition and [clover_]cuda_prec_eigensolver to single precision does indeed resolve the problem. This seems to be somewhat inconsistent with https://github.com/lattice/quda/wiki/Twisted-clover-deflated-multigrid#improvement-2-using-coarse-level-deflation, however, where --prec-precondition half is passed, while *_prec_eigensolver does not seem to be set explicitly at all. I was also under the (apparently false) impression that it would make the most sense to run the coarse eigensolver in half precision as the level of convergence required is rather low (residual 1e-4 or so).

I'm aware of course that the Wiki page will be three years old in two weeks, so it might well have grown inconsistent. For example, n-conv is also not set, while it appears to be required now.

Can you enable QUDA_BACKWARDS=ON so I can see exactly where this being called?

Will do and report back, hopefully soon.

kostrzewa commented 1 year ago

FWIW: the "site unrolling" refers to the fact that the entire site (all spin and color for a given site in spacetime) is handled by a single thread.

Thanks. How come this is being done on the coarsest grid?

kostrzewa commented 1 year ago

very useful, will certainly use backward-cpp in the future!

#16   Object "libquda.so", at 0x14fa3aef8e8d, in newMultigridQuda
#15   Object "libquda.so", at 0x14fa3aef6995, in quda::multigrid_solver::multigrid_solver(QudaMultigridParam_s&, quda::TimeProfile&)
#14   Object "libquda.so", at 0x14fa3ae6f83c, in quda::MG::MG(quda::MGParam&, quda::TimeProfile&)
#13   Object "libquda.so", at 0x14fa3ae73c94, in quda::MG::reset(bool)
#12   Object "libquda.so", at 0x14fa3ae6f83c, in quda::MG::MG(quda::MGParam&, quda::TimeProfile&)
#11   Object "libquda.so", at 0x14fa3ae740c8, in quda::MG::reset(bool)
#10   Object "libquda.so", at 0x14fa3ae78de8, in quda::MG::createCoarseSolver()
#9    Object "libquda.so", at 0x14fa3ae7c323, in quda::PreconditionedSolver::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&)
#8    Object "libquda.so", at 0x14fa3aeb43ad, in quda::CAGCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&)
#7    Object "libquda.so", at 0x14fa3ae41bd8, in quda::TRLM::operator()(std::vector<quda::ColorSpinorField, std::allocator<quda::ColorSpinorField> >&, std::vector<std::complex<double>, std::allocator<std::complex<double> > >&)
#6    Object "libquda.so", at 0x14fa3ae5c5de, in quda::EigenSolver::prepareInitialGuess(std::vector<quda::ColorSpinorField, std::allocator<quda::ColorSpinorField> >&)
#5    Object "libquda.so", at 0x14fa38aea640, in quda::blas::norm2(quda::ColorSpinorField const&)
#4    Object "libquda.so", at 0x14fa38afd472, in void quda::blas::instantiate<quda::blas::Norm2, quda::blas::Reduce, false, double, quda::ColorSpinorField const, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, double&>(double const&, double const&, double const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, double&)
#3    Object "libquda.so", at 0x14fa38b0127f, in quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::Reduce<quda::ColorSpinorField const, quda::ColorSpinorField const, quda::ColorSpinorField const, quda::ColorSpinorField const, quda::ColorSpinorField const>(double const&, double const&, double const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, double&)
#2    Object "libquda.so", at 0x14fa38b01b70, in quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(quda::qudaStream_t const&)
#1    Object "libquda.so", at 0x14fa3af33b46, in errorQuda_(char const*, char const*, int, ...)
#0    Object "libquda.so", at 0x14fa3af675e4, in quda::comm_abort(int)
maddyscientist commented 1 year ago

This issue looks like a precision one I think: I don't think we should ever be using half precision on the coarse grids here.

Thanks for this hint! Setting all [clover_]cuda_prec_precondition and [clover_]cuda_prec_eigensolver to single precision does indeed resolve the problem. This seems to be somewhat inconsistent with https://github.com/lattice/quda/wiki/Twisted-clover-deflated-multigrid#improvement-2-using-coarse-level-deflation, however, where --prec-precondition half is passed, while *_prec_eigensolver does not seem to be set explicitly at all. I was also under the (apparently false) impression that it would make the most sense to run the coarse eigensolver in half precision as the level of convergence required is rather low (residual 1e-4 or so).

I'm aware of course that the Wiki page will be three years old in two weeks, so it might well have grown inconsistent. For example, n-conv is also not set, while it appears to be required now.

This just looks like the wiki pages have grown stale: the eigenvector precision option was added after they were written. So we have five precisions to worry about now:

So in general one would want to use a double / single / half / half / single (respectively) solver. The coarse eigensolvers must use single precision since we don't support half precision on the coarse grid fermion fields (because of this need to "unroll" the site vector, which would make for a combinatoric nightmare for compilation and also reduce parallelism which kill performance).

I will update the wiki to fix this deficit, and apologies for this incongruity between the wiki and the code.

Glad you find the QUDA_BACKWARDS option helpful. I've updated the debugging page to note this, as it escaped documentation.