In this PR we prepare single work-group implementation of __parallel_find_or :
they doesn't used atomic-based synchronization;
they doesn't used syc::buffer for return result when USM-memory is available on device (we using __result_and_scratch_storage prepared by @julianmi earlier);
kernel's compilation has been removed from __parallel_find_or and their staff.
This approach gives us a big performance boost for small data sizes.
In this PR we prepare single work-group implementation of
__parallel_find_or
:atomic
-based synchronization;syc::buffer
for return result when USM-memory is available on device (we using__result_and_scratch_storage
prepared by @julianmi earlier);__parallel_find_or
and their staff.This approach gives us a big performance boost for small data sizes.