Closed maddyscientist closed 1 month ago
While I remember: I support this being a new major version number, but mayhaps it should be a 2.0-rc (release candidate) and before we go full 2.0 I can dust off this PR: https://github.com/lattice/quda/pull/1283 ; it can leverage the multirhs work and it breaks the interface, so I think it would be good to bring it along for the 2.0 ride.
While I remember: I support this being a new major version number, but mayhaps it should be a 2.0-rc (release candidate) and before we go full 2.0 I can dust off this PR: #1283 ; it can leverage the multirhs work and it breaks the interface, so I think it would be good to bring it along for the 2.0 ride.
Absolutely. It would be excellent to dust this sucker off.
Noting also, that there's a difference between updating the version number in the header and tagging a version number. I was just proposing that we do the former and not the latter. Regardless, we can do something > 1.2.x and < 2.0.0 for this PR.
Noting all solvers have been made MRHS aware now, with the exception of the legacy GMRES-DR and EigCG solvers, which are overdue for a complete cleanup which is outside the solve of this PR.
This PR is now functionally complete, and all tests are passing. This is ready for final review (@weinbe2 @hummingtree @mathiaswagner @bjoo).
I have tested the batch CG solver with a modified version of MILC that properly utilizes the multisource MILC interface function. This is available here: https://github.com/lattice/milc_qcd/tree/feature/quda-block-solver-interface ; current commit is https://github.com/lattice/milc_qcd/commit/f0404fe841712b63837711e2252d08d1491e0502 . This PR works perfectly fine with the current develop
version of MILC.
I will note that this has only tested vanilla CG. I have not yet plumbed in multi-rhs support for the MG solver; I consider that within the scope of a second QUDA PR.
When using coarsest-level deflation (perhaps just with staggered operators?) it looks like we need to change the default values corresponding to the flag --mg-eig-evals-batch-size 2 [###]
. I uncovered this with coarsest-level deflation for staggered operators, coarse Nc = 64 or 96, on sm_80
. I hit the following error when trying to converge 16 eigenvalues:
[...]
MG level 2 (GPU): RitzValue[0015]: (+2.6378149309623524e-03, +0.0000000000000000e+00) residual 1.4461873963747640e-05
MG level 2 (GPU): ERROR: nVec = 8 not instantiated
(rank 0, host ipp1-1780.nvidia.com, block_transpose.cu:116 in void quda::launch_span_nVec(v_t&, quda::cvector_ref<O>&, quda::IntList<nVec, N ...>) [with v_t = quda::ColorSpinorField; b_t = const quda::ColorSpinorField; vFloat = float; bFloat = float; int nSpin = 2; int nColor = 64; int nVec = 16; int ...N = {}; quda::cvector_ref<O> = const quda::vector_ref<const quda::ColorSpinorField>]())
MG level 2 (GPU): last kernel called was (name=cudaMemsetAsync,volume=bytes=8192,aux=zero,color_spinor_field.cpp,406)
I compiled with MG MRHS support for 16 and 32; I can't think of anywhere that I've imposed a multiple of 8. The solution was explicitly setting --mg-eig-evals-batch-size 2 16
(where 2 is the lowest level).
My (relatively reduced) command is:
mpirun -np 1 ./staggered_invert_test \
--prec double --prec-sloppy single --prec-null half --prec-precondition half \
--mass 0.1 --recon 13 --recon-sloppy 9 --recon-precondition 9 \
--dim 8 8 8 8 --gridsize 1 1 1 1 \
--dslash-type staggered --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
--verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
--inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
--mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
--mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \
--mg-setup-tol 1 1e-5 --mg-setup-inv 1 cgnr \
--nsrc 1 --niter 25 \
--mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true \
--mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true \
--mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct --mg-nu-pre 0 0 --mg-nu-post 0 4 \
--mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
--mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc --mg-nu-pre 2 0 --mg-nu-post 2 4 \
--mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
--mg-coarse-solver 2 ca-gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 --mg-coarse-solver-ca-basis-size 2 16 \
--mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose \
--mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true \
--mg-nvec 2 16 --mg-eig-n-ev 2 16 --mg-eig-n-kr 2 128 --mg-eig-tol 2 1e-1 \
--mg-eig-use-poly-acc 2 false --mg-eig-poly-deg 2 100 --mg-eig-amin 2 1e-1 \
--mg-eig-max-restarts 2 1000
Neither toggling --mg-setup-use-mma 2 false
nor --mg-dslash-use-mma 2 false
works around this.
I can't quite think of a good way to address this (yet), but I'm also not clear on the details in the weeds. Maybe you know exactly where the fix is @maddyscientist ?
When using coarsest-level deflation (perhaps just with staggered operators?) it looks like we need to change the default values corresponding to the flag
--mg-eig-evals-batch-size 2 [###]
. I uncovered this with coarsest-level deflation for staggered operators, coarse Nc = 64 or 96, onsm_80
. I hit the following error when trying to converge 16 eigenvalues:[...] MG level 2 (GPU): RitzValue[0015]: (+2.6378149309623524e-03, +0.0000000000000000e+00) residual 1.4461873963747640e-05 MG level 2 (GPU): ERROR: nVec = 8 not instantiated (rank 0, host ipp1-1780.nvidia.com, block_transpose.cu:116 in void quda::launch_span_nVec(v_t&, quda::cvector_ref<O>&, quda::IntList<nVec, N ...>) [with v_t = quda::ColorSpinorField; b_t = const quda::ColorSpinorField; vFloat = float; bFloat = float; int nSpin = 2; int nColor = 64; int nVec = 16; int ...N = {}; quda::cvector_ref<O> = const quda::vector_ref<const quda::ColorSpinorField>]()) MG level 2 (GPU): last kernel called was (name=cudaMemsetAsync,volume=bytes=8192,aux=zero,color_spinor_field.cpp,406)
I compiled with MG MRHS support for 16 and 32; I can't think of anywhere that I've imposed a multiple of 8. The solution was explicitly setting
--mg-eig-evals-batch-size 2 16
(where 2 is the lowest level).My (relatively reduced) command is:
mpirun -np 1 ./staggered_invert_test \ --prec double --prec-sloppy single --prec-null half --prec-precondition half \ --mass 0.1 --recon 13 --recon-sloppy 9 --recon-precondition 9 \ --dim 8 8 8 8 --gridsize 1 1 1 1 \ --dslash-type staggered --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \ --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \ --inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \ --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \ --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \ --mg-setup-tol 1 1e-5 --mg-setup-inv 1 cgnr \ --nsrc 1 --niter 25 \ --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true \ --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true \ --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct --mg-nu-pre 0 0 --mg-nu-post 0 4 \ --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \ --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc --mg-nu-pre 2 0 --mg-nu-post 2 4 \ --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \ --mg-coarse-solver 2 ca-gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 --mg-coarse-solver-ca-basis-size 2 16 \ --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose \ --mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true \ --mg-nvec 2 16 --mg-eig-n-ev 2 16 --mg-eig-n-kr 2 128 --mg-eig-tol 2 1e-1 \ --mg-eig-use-poly-acc 2 false --mg-eig-poly-deg 2 100 --mg-eig-amin 2 1e-1 \ --mg-eig-max-restarts 2 1000
Neither toggling
--mg-setup-use-mma 2 false
nor--mg-dslash-use-mma 2 false
works around this.I can't quite think of a good way to address this (yet), but I'm also not clear on the details in the weeds. Maybe you know exactly where the fix is @maddyscientist ?
Ok, I understand this issue. There's two things at play here:
--mg-dslash-use-mma i
acts on the i + 1
level, so you should set --mg-dslash-use-mma 1 false`, somewhat counter intuitively. This was likely an oversight from when the MMA dslash was added. I can fix this.Perhaps it would also be a good idea to have fall back to non-MMA dslash if the requested size isn't available? That would make things more bullet proof? Perhaps with a warning on first call?
This is PR is a biggie:
QudaMultigridParam::n_vec_batch
invertMultiSrcQuda
interfaceQudaInvertParam::true_res
andQudaInvertParam::true_res_hq
are now arraysDirac::prepare
andDirac::reconstruct
functions are now MRHS optimizedcvector<T>
toT
is now explicit instead of implicitDslashCoarse
is now robust to underflowMPI_THREAD_FUNNELED
Things left to do