Open danhoeflinger opened 2 months ago
Interestingly, there originally was a regression (~10%) in scan performance by using the last subgroup, last workitem of the subgroup and originating a broadcast from the last workitem of the workgroup, rather than the zeroth of each to perform the "solo" actions in the lookback.
I do not have an understanding of why this might be. copy_if
needs to use the last here to take advantage of the location of the data which needs to be communicated. I've adjusted the shared helper function for lookback to allow the individual algorithm to dictate the active subgroup, workitem and source for the broadcast, and this repaired the performance regression for scan.
This PR adds a pair of APIs (iterator and range variants) for :
oneapi::dpl::experimental::kt::gpu::copy_if
which take an input and output sequence, as well as a sequence representing a single element to store the number of elements copied (which is left on the device) and a predicate. It copies each element from the input to the output which satisfies the predicate and records the number of elements copied, preserving the relative order of the elements.Other additions within this PR
Additional notable details:
copy_if
single workgroup implementation to lift out the "copy to host" of the num copied return value one level and enable use by the new kernel templatescan
kernel template to share lookback phase, allocation manager withcopy_if
copy_if
. I don't believe this change negatively impactsscan
KT.Adapted from previous work by AidanBeltonS, Alcpz, joeatodd, adamfidel