Closed SergeyKopienko closed 4 weeks ago
@julianmi, @danhoeflinger, @adamfidel implementation has been updated.
__required_iters_per_work_item
in __parallel_find_or_n_groups_tuner<oneapi::dpl::__internal::__device_backend_tag>::operator()
in the formula form.We still have good perf profit for a lot of sizes.
@danhoeflinger, @julianmi, @adamfidel Could you please take a look again?
A few questions:
- It is stated that the performance is better for larger input sizes. Does this have any affect on smaller input sizes?
- For which devices do we see a performance benefit?
All data that can fit into one group with 32 source data items per work item - already packed into one work-group. It this PR now we tune only bigger data sizes now.
In this PR we tune the amount of groups in
__parallel_find_or
pattern. This approach give us some performance boost on bigger data sizes.