exclusive scan on GPU - Githubissues

oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html

Apache License 2.0

724 stars 113 forks source link

exclusive scan on GPU #1882

Closed jinz2014 closed 1 month ago

jinz2014 commented 1 month ago

Hello The performance difference between Cub and oneDPL may be known on an Nvidia GPU (sm = 90).

https://github.com/zjin-lcf/HeCBench/tree/master/src/scan3-cuda https://github.com/zjin-lcf/HeCBench/tree/master/src/scan3-sycl

./main 1000 33554432
Executing kernel for 1000 iterations
-------------------------------------------
Average execution time of oneDPL exclusive scan: 2704.45 (us)
Average execution time of CUDA Thrust exclusive scan: 534.145 (us)
Average execution time of CUDA CUB exclusive scan: 527.786 (us)

mmichel11 commented 1 month ago

Hi, @jinz2014. Thank you very much for reporting this performance issue to us.

I ran your benchmarks on an sm_90 device and see that Thrust is ~1.9x faster than oneDPL and CUB is ~2.0x faster. This is using oneDPL commit 3dd8eab944e31776ffb4eacc2bcd864b40c33132, icpx 2024.2.1 with the Codeplay plugin, and CUDA 12.5.

Could you provide some details on the environment, compiler, oneDPL version, etc. you used when collecting your numbers, so I can see if I can produce the >5x gap you are seeing on my side?

jinz2014 commented 1 month ago

Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) with Codeplay plugin oneDPL is 2022.6 CUDA 12.4 GCC installation: /usr/lib/gcc/x86_64-redhat-linux/8

Thanks.

jinz2014 commented 1 month ago

Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) with Codeplay plugin oneDPL is 2022.6 ROCm 6.2.0 AMD gfx90a GPU ./main 1000 33554432

Executing kernel for 1000 iterations
-------------------------------------------
Average execution time of oneDPL exclusive scan: 934.141 (us)
Average execution time of HIP Thrust exclusive scan: 243.626 (us)
Average execution time of HIP CUB exclusive scan: 220.479 (us)

mmichel11 commented 1 month ago

Thank you for the additional information and AMD performance data. I was able to reproduce the AMD performance differences with oneDPL 2022.6. Comparing oneDPL 2022.6.0 on sm_90 versus Thrust and CUB, I still see some difference between our results. I see that Thrust is ~2.28x faster and CUB is ~2.40x faster than oneDPL. This is using CUDA 12.4 and oneAPI 2024.2.1 with the Codeplay plugin.

There is quite a bit of room to tune on these architectures with our current reduce-then-scan approach. However, for performance parity, single pass scan with decoupled lookback is likely needed to minimize global memory data accesses. The algorithm requires forward progress guarantees between work groups which cannot be directly queried in SYCL. We are exploring some SYCL extensions to see if this is possible to achieve within our library in the long-term.

jinz2014 commented 1 month ago

Thank you for your answers.