lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
294 stars 100 forks source link

Feature/mrhs solvers #1489

Closed maddyscientist closed 1 month ago

maddyscientist commented 2 months ago

This is PR is a biggie:

Things left to do

weinbe2 commented 2 months ago

While I remember: I support this being a new major version number, but mayhaps it should be a 2.0-rc (release candidate) and before we go full 2.0 I can dust off this PR: https://github.com/lattice/quda/pull/1283 ; it can leverage the multirhs work and it breaks the interface, so I think it would be good to bring it along for the 2.0 ride.

maddyscientist commented 2 months ago

While I remember: I support this being a new major version number, but mayhaps it should be a 2.0-rc (release candidate) and before we go full 2.0 I can dust off this PR: #1283 ; it can leverage the multirhs work and it breaks the interface, so I think it would be good to bring it along for the 2.0 ride.

Absolutely. It would be excellent to dust this sucker off.

Noting also, that there's a difference between updating the version number in the header and tagging a version number. I was just proposing that we do the former and not the latter. Regardless, we can do something > 1.2.x and < 2.0.0 for this PR.

maddyscientist commented 1 month ago

Noting all solvers have been made MRHS aware now, with the exception of the legacy GMRES-DR and EigCG solvers, which are overdue for a complete cleanup which is outside the solve of this PR.

maddyscientist commented 1 month ago

This PR is now functionally complete, and all tests are passing. This is ready for final review (@weinbe2 @hummingtree @mathiaswagner @bjoo).

weinbe2 commented 1 month ago

I have tested the batch CG solver with a modified version of MILC that properly utilizes the multisource MILC interface function. This is available here: https://github.com/lattice/milc_qcd/tree/feature/quda-block-solver-interface ; current commit is https://github.com/lattice/milc_qcd/commit/f0404fe841712b63837711e2252d08d1491e0502 . This PR works perfectly fine with the current develop version of MILC.

I will note that this has only tested vanilla CG. I have not yet plumbed in multi-rhs support for the MG solver; I consider that within the scope of a second QUDA PR.

weinbe2 commented 1 month ago

When using coarsest-level deflation (perhaps just with staggered operators?) it looks like we need to change the default values corresponding to the flag --mg-eig-evals-batch-size 2 [###]. I uncovered this with coarsest-level deflation for staggered operators, coarse Nc = 64 or 96, on sm_80. I hit the following error when trying to converge 16 eigenvalues:

[...]
MG level 2 (GPU): RitzValue[0015]: (+2.6378149309623524e-03, +0.0000000000000000e+00) residual 1.4461873963747640e-05
MG level 2 (GPU): ERROR: nVec = 8 not instantiated
 (rank 0, host ipp1-1780.nvidia.com, block_transpose.cu:116 in void quda::launch_span_nVec(v_t&, quda::cvector_ref<O>&, quda::IntList<nVec, N ...>) [with v_t = quda::ColorSpinorField; b_t = const quda::ColorSpinorField; vFloat = float; bFloat = float; int nSpin = 2; int nColor = 64; int nVec = 16; int ...N = {}; quda::cvector_ref<O> = const quda::vector_ref<const quda::ColorSpinorField>]())
MG level 2 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=8192,aux=zero,color_spinor_field.cpp,406)

I compiled with MG MRHS support for 16 and 32; I can't think of anywhere that I've imposed a multiple of 8. The solution was explicitly setting --mg-eig-evals-batch-size 2 16 (where 2 is the lowest level).

My (relatively reduced) command is:

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.1 --recon 13 --recon-sloppy 9 --recon-precondition 9 \
  --dim 8 8 8 8 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \
  --mg-setup-tol 1 1e-5 --mg-setup-inv 1 cgnr \
  --nsrc 1 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 ca-gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 --mg-coarse-solver-ca-basis-size 2 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose \
  --mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true \
  --mg-nvec 2 16 --mg-eig-n-ev 2 16 --mg-eig-n-kr 2 128 --mg-eig-tol 2 1e-1 \
  --mg-eig-use-poly-acc 2 false --mg-eig-poly-deg 2 100 --mg-eig-amin 2 1e-1 \
  --mg-eig-max-restarts 2 1000

Neither toggling --mg-setup-use-mma 2 false nor --mg-dslash-use-mma 2 false works around this.

I can't quite think of a good way to address this (yet), but I'm also not clear on the details in the weeds. Maybe you know exactly where the fix is @maddyscientist ?

maddyscientist commented 1 month ago

When using coarsest-level deflation (perhaps just with staggered operators?) it looks like we need to change the default values corresponding to the flag --mg-eig-evals-batch-size 2 [###]. I uncovered this with coarsest-level deflation for staggered operators, coarse Nc = 64 or 96, on sm_80. I hit the following error when trying to converge 16 eigenvalues:

[...]
MG level 2 (GPU): RitzValue[0015]: (+2.6378149309623524e-03, +0.0000000000000000e+00) residual 1.4461873963747640e-05
MG level 2 (GPU): ERROR: nVec = 8 not instantiated
 (rank 0, host ipp1-1780.nvidia.com, block_transpose.cu:116 in void quda::launch_span_nVec(v_t&, quda::cvector_ref<O>&, quda::IntList<nVec, N ...>) [with v_t = quda::ColorSpinorField; b_t = const quda::ColorSpinorField; vFloat = float; bFloat = float; int nSpin = 2; int nColor = 64; int nVec = 16; int ...N = {}; quda::cvector_ref<O> = const quda::vector_ref<const quda::ColorSpinorField>]())
MG level 2 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=8192,aux=zero,color_spinor_field.cpp,406)

I compiled with MG MRHS support for 16 and 32; I can't think of anywhere that I've imposed a multiple of 8. The solution was explicitly setting --mg-eig-evals-batch-size 2 16 (where 2 is the lowest level).

My (relatively reduced) command is:

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.1 --recon 13 --recon-sloppy 9 --recon-precondition 9 \
  --dim 8 8 8 8 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \
  --mg-setup-tol 1 1e-5 --mg-setup-inv 1 cgnr \
  --nsrc 1 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 ca-gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 --mg-coarse-solver-ca-basis-size 2 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose \
  --mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true \
  --mg-nvec 2 16 --mg-eig-n-ev 2 16 --mg-eig-n-kr 2 128 --mg-eig-tol 2 1e-1 \
  --mg-eig-use-poly-acc 2 false --mg-eig-poly-deg 2 100 --mg-eig-amin 2 1e-1 \
  --mg-eig-max-restarts 2 1000

Neither toggling --mg-setup-use-mma 2 false nor --mg-dslash-use-mma 2 false works around this.

I can't quite think of a good way to address this (yet), but I'm also not clear on the details in the weeds. Maybe you know exactly where the fix is @maddyscientist ?

Ok, I understand this issue. There's two things at play here:

Perhaps it would also be a good idea to have fall back to non-MMA dslash if the requested size isn't available? That would make things more bullet proof? Perhaps with a warning on first call?