[KT] Implement reduce then scan optimization in ESIMD scan prototype

This PR is targeting the ESIMD scan development branch.

This PR implements the reduce-then-scan approach in the ESIMD scan implementation reducing the ~4N global memory accesses to ~3N with a tradeoff of extra compute in the second kernel.

The following changes have been made:

MAX_INPUTS_PER_BLOCK has been adjusted to an empirically determined value. The prior value is based on what would be used in a single-pass implementation where we first load data into SLM and are limited by the SLM available per XE-core. We do not have this limit with the two-kernel implementation and processing 2^24 elements per block has yielded the best results in my experiments.
N stores of partial prefix sums to global memory have been removed from the first kernel and an additional prefix_sum computation is added to the second kernel.
The carry in from the previous block is propagated in the second kernel instead of the first.
I have adjusted the header comment.

Testing has been performed on powers-of-2 from 2^17 to 2^28.

oneapi-src / oneDPL

[KT] Implement reduce then scan optimization in ESIMD scan prototype #1607