Open zjin-lcf opened 2 years ago
Hi,
Thank you for the data. We've recently worked to improve reduce_by_segment performance. That is available in https://github.com/oneapi-src/oneDPL/pull/608 . Pending it's review and merge to main would you please take a look at it to see if it addresses the difference you're seeing?
@zjin-lcf Did you remember if oneDPL runs were on V100. I was seeing some built issue with CUDA backend (for complex types, reported here) and just wanted to check with you.
Yes, on V100. However, I didn't evaluate the performance of the functions for complex types. Is there a reproducer for https://github.com/intel/llvm/issues/8281 ?
I was hitting the intel/llvm#8281 issue when building oneDPL for CUDA backend. Not from a test-case.
Did you clone the oneDPL repo and then specify "clang++ -I ./oneDPL/include -I./oneTBB/include ... " ?
Did you clone the oneDPL repo and then specify "clang++ -I ./oneDPL/include -I./oneTBB/include ... " ?
I was actually referring to the build of oneDPL repo itself and running unit-tests inside the repo, without TBB backend (just serial)
cmake .. -DCMAKE_CXX_COMPILER=clang++ -DONEDPL_BACKEND=dpcpp_only -DCMAKE_BUILD_TYPE=Release -DONEDPL_USE_UNNAMED_LAMBDA=ON -DCMAKE_INSTALL_PREFIX=$PWD/../install_oneapi_PrgEnvgnu -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -fmath-errno -ffast-math"
@timmiesmith
Just an update after rerunning the example. The throughput reaches ~10.2 G/s.
@zjin-lcf would you please let me know which commit of oneDPL you're using? #862 was merged recently to improve reduce_by_segment performance. This is still an algorithm we're working to improve, and I want to confirm the improvent you're seeing is from the recent PR merge.
In the oneDPL directory, "git log" shows
commit c697fac0b51ce2a36f3824bb9063dfaf6aee88ac (HEAD -> main, origin/release/2022.2, origin/main, origin/HEAD)
Author: Dan Hoeflinger <109972525+danhoeflinger@users.noreply.github.com>
Date: Tue Jun 6 14:02:14 2023 -0400
Thanks.
Thank you. This does include the recent reduce_by_segment performance improvements.
I tried to call oneDPL USM and Thrust functions for segment reduction. The pointers (d_keys, d_in, d_keys_out, and d_out) point to device memory.
I measure the execution time of the above code snippet. The performance results on an NVIDIA V100 GPU shows significant difference. If this is not what you observed, please let me know. Thank you.