oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html
Apache License 2.0
722 stars 113 forks source link

compare reduce_by_segment using oneDPL and Thrust #632

Open zjin-lcf opened 2 years ago

zjin-lcf commented 2 years ago

I tried to call oneDPL USM and Thrust functions for segment reduction. The pointers (d_keys, d_in, d_keys_out, and d_out) point to device memory.

    for (int i = 0; i < repeat; i++)
      oneapi::dpl::reduce_by_segment(policy, d_keys, d_keys + num_elements, d_in,
                            d_keys_out, d_out);

    for (int i = 0; i < repeat; i++)
      thrust::reduce_by_key(thrust::device, d_keys, d_keys + num_elements, d_in,
                            d_keys_out, d_out);

I measure the execution time of the above code snippet. The performance results on an NVIDIA V100 GPU shows significant difference. If this is not what you observed, please let me know. Thank you.

oneDPL
num_elements = 268435456
num_segments = 16777216 segment_size = 16 Throughput = 2.070432 (G/s)
num_segments = 8388608 segment_size = 32 Throughput = 2.084561 (G/s)
num_segments = 4194304 segment_size = 64 Throughput = 2.091851 (G/s)
num_segments = 2097152 segment_size = 128 Throughput = 2.094333 (G/s)
num_segments = 1048576 segment_size = 256 Throughput = 2.089707 (G/s)
num_segments = 524288 segment_size = 512 Throughput = 2.094500 (G/s)
num_segments = 262144 segment_size = 1024 Throughput = 2.097956 (G/s)
num_segments = 131072 segment_size = 2048 Throughput = 2.099221 (G/s)
num_segments = 65536 segment_size = 4096 Throughput = 2.095594 (G/s)
num_segments = 32768 segment_size = 8192 Throughput = 2.100392 (G/s)
num_segments = 16384 segment_size = 16384 Throughput = 2.100950 (G/s)
Thrust
num_elements = 268435456
num_segments = 16777216 segment_size = 16 Throughput = 65.469612 (G/s)
num_segments = 8388608 segment_size = 32 Throughput = 71.512650 (G/s)
num_segments = 4194304 segment_size = 64 Throughput = 77.745247 (G/s)
num_segments = 2097152 segment_size = 128 Throughput = 80.560936 (G/s)
num_segments = 1048576 segment_size = 256 Throughput = 80.120583 (G/s)
num_segments = 524288 segment_size = 512 Throughput = 83.196640 (G/s)
num_segments = 262144 segment_size = 1024 Throughput = 81.201439 (G/s)
num_segments = 131072 segment_size = 2048 Throughput = 79.137405 (G/s)
num_segments = 65536 segment_size = 4096 Throughput = 84.549011 (G/s)
num_segments = 32768 segment_size = 8192 Throughput = 82.928703 (G/s)
num_segments = 16384 segment_size = 16384 Throughput = 95.465546 (G/s)
timmiesmith commented 2 years ago

Hi,

Thank you for the data. We've recently worked to improve reduce_by_segment performance. That is available in https://github.com/oneapi-src/oneDPL/pull/608 . Pending it's review and merge to main would you please take a look at it to see if it addresses the difference you're seeing?

abagusetty commented 1 year ago

@zjin-lcf Did you remember if oneDPL runs were on V100. I was seeing some built issue with CUDA backend (for complex types, reported here) and just wanted to check with you.

zjin-lcf commented 1 year ago

Yes, on V100. However, I didn't evaluate the performance of the functions for complex types. Is there a reproducer for https://github.com/intel/llvm/issues/8281 ?

abagusetty commented 1 year ago

I was hitting the intel/llvm#8281 issue when building oneDPL for CUDA backend. Not from a test-case.

zjin-lcf commented 1 year ago

Did you clone the oneDPL repo and then specify "clang++ -I ./oneDPL/include -I./oneTBB/include ... " ?

abagusetty commented 1 year ago

Did you clone the oneDPL repo and then specify "clang++ -I ./oneDPL/include -I./oneTBB/include ... " ?

I was actually referring to the build of oneDPL repo itself and running unit-tests inside the repo, without TBB backend (just serial)

cmake .. -DCMAKE_CXX_COMPILER=clang++ -DONEDPL_BACKEND=dpcpp_only -DCMAKE_BUILD_TYPE=Release -DONEDPL_USE_UNNAMED_LAMBDA=ON -DCMAKE_INSTALL_PREFIX=$PWD/../install_oneapi_PrgEnvgnu -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -fmath-errno -ffast-math"
zjin-lcf commented 1 year ago

@timmiesmith

Just an update after rerunning the example. The throughput reaches ~10.2 G/s.

timmiesmith commented 1 year ago

@zjin-lcf would you please let me know which commit of oneDPL you're using? #862 was merged recently to improve reduce_by_segment performance. This is still an algorithm we're working to improve, and I want to confirm the improvent you're seeing is from the recent PR merge.

zjin-lcf commented 1 year ago

In the oneDPL directory, "git log" shows

commit c697fac0b51ce2a36f3824bb9063dfaf6aee88ac (HEAD -> main, origin/release/2022.2, origin/main, origin/HEAD)
Author: Dan Hoeflinger <109972525+danhoeflinger@users.noreply.github.com>
Date:   Tue Jun 6 14:02:14 2023 -0400

Thanks.

timmiesmith commented 1 year ago

Thank you. This does include the recent reduce_by_segment performance improvements.