Open edbennett opened 2 years ago
Further digging shows that at least part of this is user error—I was compiling with --disable-accelerator-cshift
, which unsurprisingly pushes the cshift
done within the staples onto the host, meaning a lot of data transfer. I still see 35% of CUDA time in memory, with 65% in kernels; I will dig further to try and see where that remaining time is.
Some more notes: After the CG, there are two distinct phases visible in the profile (which can be lined up with the log file):
Grid::SchurDifferentiableOperator
. This is particularly long for the RHMC, but is also present for the non-rational plain HMC case.Still to do:
configure
option that will avoid point 1 abovencu
.Dig more to see if I've missed another configure option that will avoid point 1 above
Looking more closely, the wait time appears to be in calls to setCheckerboard
from MpcDeriv
and MpcDagDeriv
. There is an alternative function, acceleratorSetCheckerboard
, defined in Lattice_transfer.h
along with setCheckerboard
, but git grep
indicates it is never used anywhere in the repository. Could this be used instead here, with a configure parameter similar to --enable-accelerator-cshift
mentioned above? Or is the function not working, or unsuitable for other reasons?
Test the adjoint RHMC
As expected this behaves the same in adjoint RHMC as in adjoint HMC.
Try and see why the GPU isn't saturated in point 2 above, possibly using ncu.
Still to do—on the first attempt, my laptop couldn't open the ncu
output as it's too big, so I need to work out how to filter it.
Since benchmarks show we can get 1.7TFLOP/s for the Wilson kernel on one A100 but only about 230GFLOP/s on an AMD Rome node, it would seem reasonable to expect that the HMC should run faster on the former than the latter. However, this isn't what I currently see in production. The time in the CG inversion does go down, as this is done on the GPU, but the time in the momentum update goes up. Profiling this in NVIDIA Nsight Systems shows that the GPU is very well-utilised in the CG inversion, but then there is a long period of heavy traffic to and from the device during the momentum update (30% host-device, 70% device-host). (I see 10–15% of CUDA usage being in kernels, and 85–90% in memory.)
The tests I've run have been on a 24.24.24.48 lattice, on a single A100 for the GPU tests, and a single 128-core CPU node for the CPU tests, both on Tursa. I've tested SU(3) fundamental, SU(2) fundamental, and SU(2) adjoint, and both RHMC and HMC, and see similar behaviours in all (although I haven't fully profiled every combination).
Is this currently expected behaviour? Is it possible I've made some trivial error in my configuration of Grid, or how I am running it? (The script I'm using closely follows the one in
systems/Tursa
.) Or do I need to recalibrate my expectations? (E.g. perhaps DWF has a much greater ratio of CG to momentum updates, so shows a much larger speedup in HMC?)