Fuses the call to outer product and subsequent path tracing into a single offload call qudaHisqForce
New QUDA HISQ force routines is around 25% faster than the previous implementation, uses less communication, and reduces GPU memory significantly. E.g., one can now run 36^3x72 evolution on a single 32 GiB V100 GPU.
Reduces CPU memory overhead in MILC since there are no outer product temporaries
Reduces CPU/GPU memory copies since there are no outer product temporaries When running on a quad GPU x86 machine, less than 5% of the total runtime is now occupied by CPU-GPU memory copies.
Added support for more Makefile settings through environment variables: MILC can now be built with QUDA support without editing the Makefile, and purely through setting environment variables. Added example file ks_imp_rhmc/compile_su3_rhmd_hisq_quda.sh that demonstrates this.
Keeping the multi-shift solver vectors on the GPU after the multi-shift solver is complete is left as a future optimization exercise.
This pull request adds support for the HISQ force rewrite in QUDA in https://github.com/lattice/quda/pull/682.
qudaHisqForce
ks_imp_rhmc/compile_su3_rhmd_hisq_quda.sh
that demonstrates this.Keeping the multi-shift solver vectors on the GPU after the multi-shift solver is complete is left as a future optimization exercise.
@mathiaswagner @stevengottlieb @detar