Tune the amount of groups in `__parallel_find_or` pattern

oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html

Apache License 2.0

720 stars 114 forks source link

Tune the amount of groups in `__parallel_find_or` pattern #1723

Closed SergeyKopienko closed 4 weeks ago

SergeyKopienko commented 1 month ago

In this PR we tune the amount of groups in __parallel_find_or pattern. This approach give us some performance boost on bigger data sizes.

SergeyKopienko commented 1 month ago

@julianmi, @danhoeflinger, @adamfidel implementation has been updated.

I wrote calculation of __required_iters_per_work_item in __parallel_find_or_n_groups_tuner<oneapi::dpl::__internal::__device_backend_tag>::operator() in the formula form.

We still have good perf profit for a lot of sizes.

SergeyKopienko commented 1 month ago

@danhoeflinger, @julianmi, @adamfidel Could you please take a look again?

SergeyKopienko commented 1 month ago

A few questions:

It is stated that the performance is better for larger input sizes. Does this have any affect on smaller input sizes?

For which devices do we see a performance benefit?

All data that can fit into one group with 32 source data items per work item - already packed into one work-group. It this PR now we tune only bigger data sizes now.