Populate the Cshift_table in the GPU

The Cshift_table vector of pair of integers is allocated in Unified Memory and is used within the Copy_plane and Copy_plane_permute functions. This array, while it is used in the LambdaApply GPU kernel (provided that the ACCELERATOR_CSHIFT macro is defined), it is populated in RAM and transferred to VRAM whenever it is accessed by the LambdaApply kernel. Since, the kernel within Copy_plane and Copy_plane_permute is called multiple times, this creates a lot of unnecessary Unified DtoH and HtoD memory operations that affects the performance. Thus, a kernel was created populate_Cshift_table that populates the Cshift_table in the GPU which completely eliminates the Unified Memory Operations, thus accelerating both of the aforementioned functions.

To test this kernel we configured, the code with these options: --enable-comms=none --enable-simd=GPU --enable-gen-simd-width=64 --enable-accelerator=cuda --enable-Nc=2 --enable-accelerator-cshift --disable-unified --disable-gparity --enable-debug=yes CXX=nvcc CXXFLAGS='-gencode arch=compute_75,code=sm_75 -std=c++14'

and ran the Test_WilsonFlow with the --accelerator-threads 32 runtime option.

We tested the Wilson Flow on a system with an Nvidia 2060 6GB, where the results presented below are from, but also on an A100 80GB getting similar results. We had no access to AMD GPUs to test the HIP version. By using Grid's accelerator_barrier() instead of CUDA specific synchronization routines, we are confident that this code will also compile and work with HIP.

	Before	After
Kernels % of total	53.9	15.8
Kernels total time (ns)	2,127,568,303	343,405,655
Kernel invocations	16,164	32,328
Unified operations total time (ns)	149,683,169	171,972
Unified operations count	89,918	57
Unified memory transfers (MB)	770.58	1.44

The three first rows include the LambdaApply kernels from Copy_plane and Copy_plane_permute plus the populate_Cshift_table kernel for the "After" column (that is why the kernel invocations are twice as many in the "After" column, since for every accelerator_for we also call populate_Cshift_table). The last three rows, about the unified memory operations, include all operations captured from the Wilson Flow test.

In the following screenshots from Nsight Systems we can see the first 24 invocations of LambdaApply from Copy_plane. Previously, this part took 11.4 ms and we can see the hundreds Unified memory operations (Page faults and Speculative prefetches), while after the changes it takes 1.05 ms with no unified memory operations. Between each invocation of LambdaApply we have a call to populate_Cshift_table.

before after

Note: Currently, all Nvidia devices have a warp size of 32 threads and all AMD devices 64. Thus, --accelerator-threads 32 should be the minimum for CUDA and --accelerator-threads 64 for HIP. This value affects the block size for the LambdaApply and populate_Cshift_table kernels and with values less than the proposed, we end up with blocks that contain even less than the warp size number of threads, wasting resources.

paboyle / Grid

Populate the Cshift_table in the GPU #421