paboyle / Grid

Data parallel C++ mathematical object library
GNU General Public License v2.0
149 stars 106 forks source link

HMC on A100 spends large amounts of time in memory copy #378

Open edbennett opened 2 years ago

edbennett commented 2 years ago

Since benchmarks show we can get 1.7TFLOP/s for the Wilson kernel on one A100 but only about 230GFLOP/s on an AMD Rome node, it would seem reasonable to expect that the HMC should run faster on the former than the latter. However, this isn't what I currently see in production. The time in the CG inversion does go down, as this is done on the GPU, but the time in the momentum update goes up. Profiling this in NVIDIA Nsight Systems shows that the GPU is very well-utilised in the CG inversion, but then there is a long period of heavy traffic to and from the device during the momentum update (30% host-device, 70% device-host). (I see 10–15% of CUDA usage being in kernels, and 85–90% in memory.)

The tests I've run have been on a 24.24.24.48 lattice, on a single A100 for the GPU tests, and a single 128-core CPU node for the CPU tests, both on Tursa. I've tested SU(3) fundamental, SU(2) fundamental, and SU(2) adjoint, and both RHMC and HMC, and see similar behaviours in all (although I haven't fully profiled every combination).

Is this currently expected behaviour? Is it possible I've made some trivial error in my configuration of Grid, or how I am running it? (The script I'm using closely follows the one in systems/Tursa.) Or do I need to recalibrate my expectations? (E.g. perhaps DWF has a much greater ratio of CG to momentum updates, so shows a much larger speedup in HMC?)

edbennett commented 2 years ago

Further digging shows that at least part of this is user error—I was compiling with --disable-accelerator-cshift, which unsurprisingly pushes the cshift done within the staples onto the host, meaning a lot of data transfer. I still see 35% of CUDA time in memory, with 65% in kernels; I will dig further to try and see where that remaining time is.

edbennett commented 2 years ago

Some more notes: After the CG, there are two distinct phases visible in the profile (which can be lined up with the log file):

  1. The force calculation on the fermion field. This shows a lot of device->host copying (51%, with 20% host->device and only 29% in kernels), which appears to be within Grid::SchurDifferentiableOperator. This is particularly long for the RHMC, but is also present for the non-rational plain HMC case.
  2. The force calculations on the gauge field for successive momentum updates. These do run on the GPU, but do not come close to fully occupying it. In the fundamental RHMC case there is almost no communication between the host and device here (13%; 87% compute), but for the adjoint HMC case the communication is still substantial (38%; 62% compute), albeit not as high as in the force calculation. I need to check how this behaves for adjoint RHMC; I suspect it will be similar to adjoint HMC. Removing this bottleneck may not speed things up by much however as the fundamental RHMC case doesn't saturate the GPU much more than the adjoint HMC case does.

Still to do:

edbennett commented 2 years ago

Dig more to see if I've missed another configure option that will avoid point 1 above

Looking more closely, the wait time appears to be in calls to setCheckerboard from MpcDeriv and MpcDagDeriv. There is an alternative function, acceleratorSetCheckerboard, defined in Lattice_transfer.h along with setCheckerboard, but git grep indicates it is never used anywhere in the repository. Could this be used instead here, with a configure parameter similar to --enable-accelerator-cshift mentioned above? Or is the function not working, or unsuitable for other reasons?

Test the adjoint RHMC

As expected this behaves the same in adjoint RHMC as in adjoint HMC.

Try and see why the GPU isn't saturated in point 2 above, possibly using ncu.

Still to do—on the first attempt, my laptop couldn't open the ncu output as it's too big, so I need to work out how to filter it.