The double precision performance of fft-sycl lags behind fft-cuda, achieving only 70% of CUDA's performance. Profiling with nsys (nsys nvprof --print-gpu-trace) reveals that the SYCL code utilizes more registers (255 registers/thread) than its CUDA counterpart (72 registers/thread), leading to register spills as observed with -Xcuda-ptxas --verbose.
Interestingly, the single precision version of the code shows comparable performance between SYCL and CUDA. Although SYCL uses (83 registers/thread) and cuda uses (48 registers/thread), there is no register spill for both.
Seeking recommendations to enhance the performance of the double precision SYCL implementation would be greatly appreciated.
Describe the bug
The double precision performance of fft-sycl lags behind fft-cuda, achieving only 70% of CUDA's performance. Profiling with nsys (
nsys nvprof --print-gpu-trace
) reveals that the SYCL code utilizes more registers (255 registers/thread) than its CUDA counterpart (72 registers/thread), leading to register spills as observed with -Xcuda-ptxas --verbose.Interestingly, the single precision version of the code shows comparable performance between SYCL and CUDA. Although SYCL uses (83 registers/thread) and cuda uses (48 registers/thread), there is no register spill for both.
Seeking recommendations to enhance the performance of the double precision SYCL implementation would be greatly appreciated.
To reproduce
To reproduce
Clone https://github.com/zjin-lcf/HeCBench
For sycl (double precission):
cd HeCBench/src/fft-sycl
make CUDA=yes CUDA_ARCH=sm_80 GCC_TOOLCHAIN=""
Run code:
./main 3 1000
For cuda (double precission):
cd HeCBench/src/fft-cuda
make ARCH=sm_80
./main 3 1000
To build the single precision versions, ensure the
SINGLE_PRECISION
preprocessor directive is defined by editingmain.cpp
orMakefile
.Here is the result of my run on NVIDIA A100:
For Double Precision
For Single Precision
Environment
Platforms: 3 Platform [#1]: Version : OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3 Name : Intel(R) FPGA Emulation Platform for OpenCL(TM) Vendor : Intel(R) Corporation Devices : 1 Device [#0]: Type : fpga Version : OpenCL 1.2 Name : Intel(R) FPGA Emulation Device Vendor : Intel(R) Corporation Driver : 2024.17.5.0.08_160000.xmain-hotfix Num SubDevices : 0 Num SubSubDevices : 0 Aspects : accelerator fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_atomic_host_allocations usm_atomic_shared_allocations ext_oneapi_srgb ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_intel_fpga_task_sequence ext_oneapi_private_alloca info::device::sub_group_sizes: 4 8 16 32 64 Architecture: unknown Platform [#2]: Version : OpenCL 3.0 LINUX Name : Intel(R) OpenCL Vendor : Intel(R) Corporation Devices : 1 Device [#1]: Type : cpu Version : OpenCL 3.0 (Build 0) Name : Intel(R) Xeon(R) CPU @ 2.20GHz Vendor : Intel(R) Corporation Driver : 2024.17.5.0.08_160000.xmain-hotfix Num SubDevices : 0 Num SubSubDevices : 0 Aspects : cpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_oneapi_srgb ext_oneapi_native_assert ext_intel_legacy_image ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_oneapi_private_alloca info::device::sub_group_sizes: 4 8 16 32 64 Architecture: x86_64 Platform [#3]: Version : CUDA 12.4 Name : NVIDIA CUDA BACKEND Vendor : NVIDIA Corporation Devices : 1 Device [#0]: Type : gpu Version : 8.0 Name : NVIDIA A100-SXM4-40GB Vendor : NVIDIA Corporation Driver : CUDA 12.4 UUID : 9031191727913913712018114216122172201180135 Num SubDevices : 0 Num SubSubDevices : 0 Aspects : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_bfloat16_math_functions ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime. ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag info::device::sub_group_sizes: 32 Architecture: nvidia_gpu_sm_80 default_selector() : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4] accelerator_selector() : fpga, Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.5.0.08_160000.xmain-hotfix] cpu_selector() : cpu, Intel(R) OpenCL, Intel(R) Xeon(R) CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix] gpu_selector() : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4] custom_selector(gpu) : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4] custom_selector(cpu) : cpu, Intel(R) OpenCL, Intel(R) Xeon(R) CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix] custom_selector(acc) : fpga, Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.5.0.08_160000.xmain-hotfix]