SYCL Shows Lower Performance Compared to CUDA in Double Precision FFT benchmark sample code.

Describe the bug

The double precision performance of fft-sycl lags behind fft-cuda, achieving only 70% of CUDA's performance. Profiling with nsys (nsys nvprof --print-gpu-trace) reveals that the SYCL code utilizes more registers (255 registers/thread) than its CUDA counterpart (72 registers/thread), leading to register spills as observed with -Xcuda-ptxas --verbose.

Interestingly, the single precision version of the code shows comparable performance between SYCL and CUDA. Although SYCL uses (83 registers/thread) and cuda uses (48 registers/thread), there is no register spill for both.

Seeking recommendations to enhance the performance of the double precision SYCL implementation would be greatly appreciated.

To reproduce

Clone https://github.com/zjin-lcf/HeCBench

For sycl (double precission):
cd HeCBench/src/fft-sycl
make CUDA=yes CUDA_ARCH=sm_80 GCC_TOOLCHAIN=""
Run code: ./main 3 1000

For cuda (double precission):

cd HeCBench/src/fft-cuda
make ARCH=sm_80
Run code: ./main 3 1000

To build the single precision versions, ensure the SINGLE_PRECISION preprocessor directive is defined by editing main.cpp or Makefile.

Here is the result of my run on NVIDIA A100:

For Double Precision

SYCL:

~/HeCBench/src/fft-sycl$ ./main 3 1000
used_bytes=268435456, n_cmplx=16777216
FFT PASS
iFFT PASS
Average kernel execution time 0.00117844 (s)

CUDA:

~/HeCBench/src/fft-cuda$ ./main 3 1000
used_bytes=268435456, n_cmplx=1.67772e+07
FFT PASS
iFFT PASS
Average kernel execution time 0.000832436 (s)

For Single Precision

SYCL:

~/HeCBench/src/fft-sycl$ ./main 3 1000
used_bytes=268435456, n_cmplx=33554432
FFT PASS
iFFT PASS
Average kernel execution time 0.000819632 (s)

CUDA:

~/HeCBench/src/fft-cuda$ ./main 3 1000
used_bytes=268435456, n_cmplx=3.35544e+07
FFT PASS
iFFT PASS
Average kernel execution time 0.000798072 (s)

Environment

OS: Ubuntu 23.10
NVIDIA A100

clang++ --version:

$ clang++ --version
clang version 19.0.0git (https://github.com/intel/llvm 666cf66258363ba1c416d054cab38c85c04fe389)
Target: x86_64-unknown-linux-gnu
Thread model: posix
Build config: +assertions

sycl-ls --verbose:


$ sycl-ls --verbose
[opencl:fpga][opencl:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu][opencl:1] Intel(R) OpenCL,            Intel(R) Xeon(R) CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4]

Platforms: 3 Platform [#1]: Version : OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3 Name : Intel(R) FPGA Emulation Platform for OpenCL(TM) Vendor : Intel(R) Corporation Devices : 1 Device [#0]: Type : fpga Version : OpenCL 1.2 Name : Intel(R) FPGA Emulation Device Vendor : Intel(R) Corporation Driver : 2024.17.5.0.08_160000.xmain-hotfix Num SubDevices : 0 Num SubSubDevices : 0 Aspects : accelerator fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_atomic_host_allocations usm_atomic_shared_allocations ext_oneapi_srgb ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_intel_fpga_task_sequence ext_oneapi_private_alloca info::device::sub_group_sizes: 4 8 16 32 64 Architecture: unknown Platform [#2]: Version : OpenCL 3.0 LINUX Name : Intel(R) OpenCL Vendor : Intel(R) Corporation Devices : 1 Device [#1]: Type : cpu Version : OpenCL 3.0 (Build 0) Name : Intel(R) Xeon(R) CPU @ 2.20GHz Vendor : Intel(R) Corporation Driver : 2024.17.5.0.08_160000.xmain-hotfix Num SubDevices : 0 Num SubSubDevices : 0 Aspects : cpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_oneapi_srgb ext_oneapi_native_assert ext_intel_legacy_image ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_oneapi_private_alloca info::device::sub_group_sizes: 4 8 16 32 64 Architecture: x86_64 Platform [#3]: Version : CUDA 12.4 Name : NVIDIA CUDA BACKEND Vendor : NVIDIA Corporation Devices : 1 Device [#0]: Type : gpu Version : 8.0 Name : NVIDIA A100-SXM4-40GB Vendor : NVIDIA Corporation Driver : CUDA 12.4 UUID : 9031191727913913712018114216122172201180135 Num SubDevices : 0 Num SubSubDevices : 0 Aspects : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_bfloat16_math_functions ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime. ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag info::device::sub_group_sizes: 32 Architecture: nvidia_gpu_sm_80 default_selector() : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4] accelerator_selector() : fpga, Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.5.0.08_160000.xmain-hotfix] cpu_selector() : cpu, Intel(R) OpenCL, Intel(R) Xeon(R) CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix] gpu_selector() : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4] custom_selector(gpu) : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4] custom_selector(cpu) : cpu, Intel(R) OpenCL, Intel(R) Xeon(R) CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix] custom_selector(acc) : fpga, Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.5.0.08_160000.xmain-hotfix]



### Additional context

_No response_

intel / llvm