The Cshift_table vector of pair of integers is allocated in Unified Memory and is used within the Copy_plane and Copy_plane_permute functions. This array, while it is used in the LambdaApply GPU kernel (provided that the ACCELERATOR_CSHIFT macro is defined), it is populated in RAM and transferred to VRAM whenever it is accessed by the LambdaApply kernel. Since, the kernel within Copy_plane and Copy_plane_permute is called multiple times, this creates a lot of unnecessary Unified DtoH and HtoD memory operations that affects the performance. Thus, a kernel was created populate_Cshift_table that populates the Cshift_table in the GPU which completely eliminates the Unified Memory Operations, thus accelerating both of the aforementioned functions.
To test this kernel we configured, the code with these options:
--enable-comms=none --enable-simd=GPU --enable-gen-simd-width=64 --enable-accelerator=cuda --enable-Nc=2 --enable-accelerator-cshift --disable-unified --disable-gparity --enable-debug=yes CXX=nvcc CXXFLAGS='-gencode arch=compute_75,code=sm_75 -std=c++14'
and ran the Test_WilsonFlow with the --accelerator-threads 32 runtime option.
We tested the Wilson Flow on a system with an Nvidia 2060 6GB, where the results presented below are from, but also on an A100 80GB getting similar results. We had no access to AMD GPUs to test the HIP version. By using Grid's accelerator_barrier() instead of CUDA specific synchronization routines, we are confident that this code will also compile and work with HIP.
Before
After
Kernels % of total
53.9
15.8
Kernels total time (ns)
2,127,568,303
343,405,655
Kernel invocations
16,164
32,328
Unified operations total time (ns)
149,683,169
171,972
Unified operations count
89,918
57
Unified memory transfers (MB)
770.58
1.44
The three first rows include the LambdaApply kernels from Copy_plane and Copy_plane_permute plus the populate_Cshift_table kernel for the "After" column (that is why the kernel invocations are twice as many in the "After" column, since for every accelerator_for we also call populate_Cshift_table). The last three rows, about the unified memory operations, include all operations captured from the Wilson Flow test.
In the following screenshots from Nsight Systems we can see the first 24 invocations of LambdaApply from Copy_plane. Previously, this part took 11.4 ms and we can see the hundreds Unified memory operations (Page faults and Speculative prefetches), while after the changes it takes 1.05 ms with no unified memory operations. Between each invocation of LambdaApply we have a call to populate_Cshift_table.
Note: Currently, all Nvidia devices have a warp size of 32 threads and all AMD devices 64. Thus, --accelerator-threads 32 should be the minimum for CUDA and --accelerator-threads 64 for HIP. This value affects the block size for the LambdaApply and populate_Cshift_table kernels and with values less than the proposed, we end up with blocks that contain even less than the warp size number of threads, wasting resources.
The
Cshift_table
vector of pair of integers is allocated in Unified Memory and is used within theCopy_plane
andCopy_plane_permute
functions. This array, while it is used in theLambdaApply
GPU kernel (provided that theACCELERATOR_CSHIFT
macro is defined), it is populated in RAM and transferred to VRAM whenever it is accessed by theLambdaApply
kernel. Since, the kernel withinCopy_plane
andCopy_plane_permute
is called multiple times, this creates a lot of unnecessary Unified DtoH and HtoD memory operations that affects the performance. Thus, a kernel was createdpopulate_Cshift_table
that populates theCshift_table
in the GPU which completely eliminates the Unified Memory Operations, thus accelerating both of the aforementioned functions.To test this kernel we configured, the code with these options:
--enable-comms=none --enable-simd=GPU --enable-gen-simd-width=64 --enable-accelerator=cuda --enable-Nc=2 --enable-accelerator-cshift --disable-unified --disable-gparity --enable-debug=yes CXX=nvcc CXXFLAGS='-gencode arch=compute_75,code=sm_75 -std=c++14'
and ran the
Test_WilsonFlow
with the--accelerator-threads 32
runtime option.We tested the Wilson Flow on a system with an Nvidia 2060 6GB, where the results presented below are from, but also on an A100 80GB getting similar results. We had no access to AMD GPUs to test the HIP version. By using Grid's
accelerator_barrier()
instead of CUDA specific synchronization routines, we are confident that this code will also compile and work with HIP.The three first rows include the
LambdaApply
kernels fromCopy_plane
andCopy_plane_permute
plus thepopulate_Cshift_table
kernel for the "After" column (that is why the kernel invocations are twice as many in the "After" column, since for everyaccelerator_for
we also callpopulate_Cshift_table
). The last three rows, about the unified memory operations, include all operations captured from the Wilson Flow test.In the following screenshots from Nsight Systems we can see the first 24 invocations of
LambdaApply
fromCopy_plane
. Previously, this part took 11.4 ms and we can see the hundreds Unified memory operations (Page faults and Speculative prefetches), while after the changes it takes 1.05 ms with no unified memory operations. Between each invocation ofLambdaApply
we have a call topopulate_Cshift_table
.Note: Currently, all Nvidia devices have a warp size of 32 threads and all AMD devices 64. Thus,
--accelerator-threads 32
should be the minimum for CUDA and--accelerator-threads 64
for HIP. This value affects the block size for theLambdaApply
andpopulate_Cshift_table
kernels and with values less than the proposed, we end up with blocks that contain even less than the warp size number of threads, wasting resources.