lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
279 stars 94 forks source link

Multi-RHS support for Prolongator and Restrictor #1434

Closed maddyscientist closed 5 months ago

maddyscientist commented 5 months ago

This PR adds support for multi-RHS to both the prolongator and restrictor. This will be an essential building block for multi-RHS multigrid solvers:

weinbe2 commented 5 months ago

This passes a visual review as the code stands. Upon a more detailed review of the generated code, there's some stack spillage at staggered-size Nc --- @maddyscientist 's is going to look into it, and if it's easy to fix will do, if not we'll get this merged in regardless and I'll file an issue to revisit it.

weinbe2 commented 5 months ago

Compile:

cmake -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON   -DQUDA_GPU_ARCH=sm_80 -DQUDA_DOWNLOAD_USQCD=ON -DQUDA_QIO=ON -DQUDA_QMP=ON   -DQUDA_MULTIGRID=ON -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" ../quda

Generate a well behaved 16^4 field:

mpirun -np 1 ./heatbath_test --dim 16 16 16 16 --save-gauge l16t16b7p0   --heatbath-beta 7.0 --heatbath-coldstart true --heatbath-num-steps 10 --heatbath-warmup-steps 1000

Run a 3 <-> 64 <-> 96 test, which flexes recursion:

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.01 --recon 13 --recon-sloppy 9 --recon-precondition 9 \
  --dim 16 16 16 16 --gridsize 1 1 1 1 --load-gauge l16t16b7p0 \
  --dslash-type asqtad --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \
  --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 \
  --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
  --nsrc 1 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
  --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose

To test the recursion in the multi-rhs coarse operator, append these lines to perform block TRLM with a block size of 48, which should properly split into {32, 16}:

  --mg-eig 3 true --mg-eig-type 3 blktrlm --mg-eig-use-dagger 3 false --mg-eig-use-normop 3 true \
  --mg-nvec 3 48 --mg-eig-n-ev 3 96 --mg-eig-n-kr 3 192 --mg-eig-tol 3 1e-4 --mg-eig-use-poly-acc 3 false \
  --mg-eig-block-size 3 48 --mg-eig-spectrum 3 SR \
  --mg-eig-max-restarts 3 1000