[KT] Single Pass Copy_if Kernel Template

This PR adds a pair of APIs (iterator and range variants) for : oneapi::dpl::experimental::kt::gpu::copy_if which take an input and output sequence, as well as a sequence representing a single element to store the number of elements copied (which is left on the device) and a predicate. It copies each element from the input to the output which satisfies the predicate and records the number of elements copied, preserving the relative order of the elements.

Other additions within this PR

Tests for these new APIs

Additional notable details:

Refactor of oneDPL mainline copy_if single workgroup implementation to lift out the "copy to host" of the num copied return value one level and enable use by the new kernel template
Refactor of scan kernel template to share lookback phase, allocation manager with copy_if
Adjust lookback phase to rely upon the last subgroup / last work-item rather than the first subgroup / first work-item to do operations which we want only a single subgroup or work-item to do. This enables propagation of "running" scan values without extra intra-workgroup communication for copy_if. I don't believe this change negatively impacts scan KT.

Adapted from previous work by AidanBeltonS, Alcpz, joeatodd, adamfidel

oneapi-src / oneDPL

[KT] Single Pass Copy_if Kernel Template #1616