Feature/mrhs solvers - Githubissues

maddyscientist commented 2 months ago

This is PR is a biggie:

Adds MRHS solvers to QUDA (at present CG, MR, SD, GCR, CA-GCR) are all implemented
- Reliable updates performed using 0th RHS
- Do not flag convergence until all RHS are converged
Adds MRHS support to multigrid
- Batched null space finding implemented, exposed with new parameter QudaMultigridParam::n_vec_batch
- MRHS supported in actual solves as well
Expose MRHS solvers using invertMultiSrcQuda interface
- This is compatible with the split-grid interface, batching is used across the number of sources per sub grid
Explicit breaking of the interface
- QudaInvertParam::true_res and QudaInvertParam::true_res_hq are now arrays
Batched deflation implemented
- Batch eigenvalue deflation confirmed working using staggered fermions
- Batch singular-value deflation confirmed working as a coarse grid deflator
Eigenvalue computation is now batch, to take advantage of MRHS
All Dirac::prepare and Dirac::reconstruct functions are now MRHS optimized
Chronological solver is now MRHS optimized
Cast from cvector<T> to T is now explicit instead of implicit
- This makes it much easier to catch bugs when updating code to be MRHS
Tensor-core 3xFP16 DslashCoarse is now robust to underflow
Miscellaneous cleanup and additions to aid all of the above
Fixes some earlier bugs introduced in prior MRHS PRs
Add MMA instantiations for 32 -> 64 coarsening
Since QUDA is now threaded, update to using MPI_THREAD_FUNNELED
Augmentations to the power monitoring

Things left to do

[x] Add real MRHS support to all solvers (or at least to a few outstanding ones: CGNE, CGNR, BiCGStab, etc.)
[x] Verify NVSHMEM MRHS operators are all working as expected
[ ] Update QUDA version number due to breakage of interface (QUDA 2.0?)
[ ] Fix reporting of true residual per RHS when running split grid

weinbe2 commented 2 months ago

While I remember: I support this being a new major version number, but mayhaps it should be a 2.0-rc (release candidate) and before we go full 2.0 I can dust off this PR: https://github.com/lattice/quda/pull/1283 ; it can leverage the multirhs work and it breaks the interface, so I think it would be good to bring it along for the 2.0 ride.

maddyscientist commented 2 months ago

While I remember: I support this being a new major version number, but mayhaps it should be a 2.0-rc (release candidate) and before we go full 2.0 I can dust off this PR: #1283 ; it can leverage the multirhs work and it breaks the interface, so I think it would be good to bring it along for the 2.0 ride.

Absolutely. It would be excellent to dust this sucker off.

Noting also, that there's a difference between updating the version number in the header and tagging a version number. I was just proposing that we do the former and not the latter. Regardless, we can do something > 1.2.x and < 2.0.0 for this PR.

maddyscientist commented 1 month ago

Noting all solvers have been made MRHS aware now, with the exception of the legacy GMRES-DR and EigCG solvers, which are overdue for a complete cleanup which is outside the solve of this PR.

maddyscientist commented 1 month ago

This PR is now functionally complete, and all tests are passing. This is ready for final review (@weinbe2 @hummingtree @mathiaswagner @bjoo).

weinbe2 commented 1 month ago

I have tested the batch CG solver with a modified version of MILC that properly utilizes the multisource MILC interface function. This is available here: https://github.com/lattice/milc_qcd/tree/feature/quda-block-solver-interface ; current commit is https://github.com/lattice/milc_qcd/commit/f0404fe841712b63837711e2252d08d1491e0502 . This PR works perfectly fine with the current develop version of MILC.

I will note that this has only tested vanilla CG. I have not yet plumbed in multi-rhs support for the MG solver; I consider that within the scope of a second QUDA PR.

weinbe2 commented 1 month ago

When using coarsest-level deflation (perhaps just with staggered operators?) it looks like we need to change the default values corresponding to the flag --mg-eig-evals-batch-size 2 [###]. I uncovered this with coarsest-level deflation for staggered operators, coarse Nc = 64 or 96, on sm_80. I hit the following error when trying to converge 16 eigenvalues:

[...]
MG level 2 (GPU): RitzValue[0015]: (+2.6378149309623524e-03, +0.0000000000000000e+00) residual 1.4461873963747640e-05
MG level 2 (GPU): ERROR: nVec = 8 not instantiated
 (rank 0, host ipp1-1780.nvidia.com, block_transpose.cu:116 in void quda::launch_span_nVec(v_t&, quda::cvector_ref<O>&, quda::IntList<nVec, N ...>) [with v_t = quda::ColorSpinorField; b_t = const quda::ColorSpinorField; vFloat = float; bFloat = float; int nSpin = 2; int nColor = 64; int nVec = 16; int ...N = {}; quda::cvector_ref<O> = const quda::vector_ref<const quda::ColorSpinorField>]())
MG level 2 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=8192,aux=zero,color_spinor_field.cpp,406)

I compiled with MG MRHS support for 16 and 32; I can't think of anywhere that I've imposed a multiple of 8. The solution was explicitly setting --mg-eig-evals-batch-size 2 16 (where 2 is the lowest level).

My (relatively reduced) command is:

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.1 --recon 13 --recon-sloppy 9 --recon-precondition 9 \
  --dim 8 8 8 8 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \
  --mg-setup-tol 1 1e-5 --mg-setup-inv 1 cgnr \
  --nsrc 1 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 ca-gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 --mg-coarse-solver-ca-basis-size 2 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose \
  --mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true \
  --mg-nvec 2 16 --mg-eig-n-ev 2 16 --mg-eig-n-kr 2 128 --mg-eig-tol 2 1e-1 \
  --mg-eig-use-poly-acc 2 false --mg-eig-poly-deg 2 100 --mg-eig-amin 2 1e-1 \
  --mg-eig-max-restarts 2 1000

Neither toggling --mg-setup-use-mma 2 false nor --mg-dslash-use-mma 2 false works around this.

I can't quite think of a good way to address this (yet), but I'm also not clear on the details in the weeds. Maybe you know exactly where the fix is @maddyscientist ?

maddyscientist commented 1 month ago

When using coarsest-level deflation (perhaps just with staggered operators?) it looks like we need to change the default values corresponding to the flag --mg-eig-evals-batch-size 2 [###]. I uncovered this with coarsest-level deflation for staggered operators, coarse Nc = 64 or 96, on sm_80. I hit the following error when trying to converge 16 eigenvalues:

[...]
MG level 2 (GPU): RitzValue[0015]: (+2.6378149309623524e-03, +0.0000000000000000e+00) residual 1.4461873963747640e-05
MG level 2 (GPU): ERROR: nVec = 8 not instantiated
 (rank 0, host ipp1-1780.nvidia.com, block_transpose.cu:116 in void quda::launch_span_nVec(v_t&, quda::cvector_ref<O>&, quda::IntList<nVec, N ...>) [with v_t = quda::ColorSpinorField; b_t = const quda::ColorSpinorField; vFloat = float; bFloat = float; int nSpin = 2; int nColor = 64; int nVec = 16; int ...N = {}; quda::cvector_ref<O> = const quda::vector_ref<const quda::ColorSpinorField>]())
MG level 2 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=8192,aux=zero,color_spinor_field.cpp,406)

I compiled with MG MRHS support for 16 and 32; I can't think of anywhere that I've imposed a multiple of 8. The solution was explicitly setting --mg-eig-evals-batch-size 2 16 (where 2 is the lowest level).

My (relatively reduced) command is:

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.1 --recon 13 --recon-sloppy 9 --recon-precondition 9 \
  --dim 8 8 8 8 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \
  --mg-setup-tol 1 1e-5 --mg-setup-inv 1 cgnr \
  --nsrc 1 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 ca-gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 --mg-coarse-solver-ca-basis-size 2 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose \
  --mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true \
  --mg-nvec 2 16 --mg-eig-n-ev 2 16 --mg-eig-n-kr 2 128 --mg-eig-tol 2 1e-1 \
  --mg-eig-use-poly-acc 2 false --mg-eig-poly-deg 2 100 --mg-eig-amin 2 1e-1 \
  --mg-eig-max-restarts 2 1000

Neither toggling --mg-setup-use-mma 2 false nor --mg-dslash-use-mma 2 false works around this.

I can't quite think of a good way to address this (yet), but I'm also not clear on the details in the weeds. Maybe you know exactly where the fix is @maddyscientist ?

Ok, I understand this issue. There's two things at play here:

For whatever reason --mg-dslash-use-mma i acts on the i + 1 level, so you should set --mg-dslash-use-mma 1 false`, somewhat counter intuitively. This was likely an oversight from when the MMA dslash was added. I can fix this.
If the evec batch size isn't set at the command line, it will use a default value of 8, which is what you've found. Perhaps 16 would be a better value for this, since that's the default MMA MRHS sizes in CMake?

Perhaps it would also be a good idea to have fall back to non-MMA dslash if the requested size isn't available? That would make things more bullet proof? Perhaps with a warning on first call?

lattice / quda

Feature/mrhs solvers #1489