oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html
Apache License 2.0
716 stars 113 forks source link

[KT] Single Pass Copy_if Kernel Template #1616

Open danhoeflinger opened 2 months ago

danhoeflinger commented 2 months ago

This PR adds a pair of APIs (iterator and range variants) for : oneapi::dpl::experimental::kt::gpu::copy_if which take an input and output sequence, as well as a sequence representing a single element to store the number of elements copied (which is left on the device) and a predicate. It copies each element from the input to the output which satisfies the predicate and records the number of elements copied, preserving the relative order of the elements.

Other additions within this PR

Additional notable details:

Adapted from previous work by AidanBeltonS, Alcpz, joeatodd, adamfidel

danhoeflinger commented 1 month ago

Interestingly, there originally was a regression (~10%) in scan performance by using the last subgroup, last workitem of the subgroup and originating a broadcast from the last workitem of the workgroup, rather than the zeroth of each to perform the "solo" actions in the lookback.

I do not have an understanding of why this might be. copy_if needs to use the last here to take advantage of the location of the data which needs to be communicated. I've adjusted the shared helper function for lookback to allow the individual algorithm to dictate the active subgroup, workitem and source for the broadcast, and this repaired the performance regression for scan.