This PR is targeting the ESIMD scan development branch.
This PR implements the reduce-then-scan approach in the ESIMD scan implementation reducing the ~4N global memory accesses to ~3N with a tradeoff of extra compute in the second kernel.
The following changes have been made:
MAX_INPUTS_PER_BLOCK has been adjusted to an empirically determined value. The prior value is based on what would be used in a single-pass implementation where we first load data into SLM and are limited by the SLM available per XE-core. We do not have this limit with the two-kernel implementation and processing 2^24 elements per block has yielded the best results in my experiments.
N stores of partial prefix sums to global memory have been removed from the first kernel and an additional prefix_sum computation is added to the second kernel.
The carry in from the previous block is propagated in the second kernel instead of the first.
I have adjusted the header comment.
Testing has been performed on powers-of-2 from 2^17 to 2^28.
This PR is targeting the ESIMD scan development branch.
This PR implements the reduce-then-scan approach in the ESIMD scan implementation reducing the ~4N global memory accesses to ~3N with a tradeoff of extra compute in the second kernel.
The following changes have been made:
MAX_INPUTS_PER_BLOCK
has been adjusted to an empirically determined value. The prior value is based on what would be used in a single-pass implementation where we first load data into SLM and are limited by the SLM available per XE-core. We do not have this limit with the two-kernel implementation and processing2^24
elements per block has yielded the best results in my experiments.N
stores of partial prefix sums to global memory have been removed from the first kernel and an additionalprefix_sum
computation is added to the second kernel.Testing has been performed on powers-of-2 from 2^17 to 2^28.