Single WG implementation of `__parallel_find_or`

oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html

Apache License 2.0

720 stars 114 forks source link

Closed SergeyKopienko closed 1 month ago

SergeyKopienko commented 1 month ago

In this PR we prepare single work-group implementation of __parallel_find_or :

they doesn't used atomic-based synchronization;
they doesn't used syc::buffer for return result when USM-memory is available on device (we using __result_and_scratch_storage prepared by @julianmi earlier);
kernel's compilation has been removed from __parallel_find_or and their staff.

This approach gives us a big performance boost for small data sizes.

SergeyKopienko commented 1 month ago

@julianmi , @danhoeflinger , @adamfidel have somebody additional comments for this PR? Looks like all comments has been fixed now.