Revamp `-tritonintelgpu-optimize-reduction-locality`

-tritonintelgpu-optimize-reduction-locality is incorrect as register reordering may lead to incorrect results. Also, it can be greatly improved so optimal layouts are propagated instead of unoptimal sliced ones.

A DPAS layout that "covers" the tensor in dimension 0 can be represented as a 7D layout:

#triton_gpu.blocked<{
    sizePerThread = [1, repeat_count, rep_cluster[1], rep_cluster[0], 1, 
                     shape[1]/(execution_size*rep_cluster[1]*warps_per_cta[1]), 1], 
    threadsPerWarp = [16, 1, 1, 1, 1, 1, 1], 
    warpsPerCTA = [1, 1, 1, 1, warps_per_cta[1], 1, warps_per_cta[0]], 
    order = [0, 1, 2, 3, 4, 5, 6]}>

In which dimensions 0, 2, 4 and 5 correspond to dimension 1 in the original layout.

A reduction on the original DPAS layout axis 1 (fast changing axis) can be represented as follows in this new layout:

Reduction on axis 2 and 5: elementwise
Sub-group transpose so the original axis 0 has 16 threads per warp and the original axis 1 has 16 elements per thread
Reduction on the remaining dimensions representing the original axis 1
Go back to original type

This last step is crucial to get right. It should be split exactly as:

reshape to original shape
convert_layout to original layout

As the original layout is suboptimal and reshape operations propagate layouts, swapping the reshape and layout conversions would lead to propagating the suboptimal layout. This is what we are getting wrong (in addition to semantics change due to register ordering) in the current pass.

intel / intel-xpu-backend-for-triton

Revamp `-tritonintelgpu-optimize-reduction-locality` #2752