intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
144 stars 44 forks source link

Revamp `-tritonintelgpu-optimize-reduction-locality` #2752

Open victor-eds opened 6 days ago

victor-eds commented 6 days ago

-tritonintelgpu-optimize-reduction-locality is incorrect as register reordering may lead to incorrect results. Also, it can be greatly improved so optimal layouts are propagated instead of unoptimal sliced ones.

A DPAS layout that "covers" the tensor in dimension 0 can be represented as a 7D layout:

#triton_gpu.blocked<{
    sizePerThread = [1, repeat_count, rep_cluster[1], rep_cluster[0], 1, 
                     shape[1]/(execution_size*rep_cluster[1]*warps_per_cta[1]), 1], 
    threadsPerWarp = [16, 1, 1, 1, 1, 1, 1], 
    warpsPerCTA = [1, 1, 1, 1, warps_per_cta[1], 1, warps_per_cta[0]], 
    order = [0, 1, 2, 3, 4, 5, 6]}>

In which dimensions 0, 2, 4 and 5 correspond to dimension 1 in the original layout.

A reduction on the original DPAS layout axis 1 (fast changing axis) can be represented as follows in this new layout:

This last step is crucial to get right. It should be split exactly as:

As the original layout is suboptimal and reshape operations propagate layouts, swapping the reshape and layout conversions would lead to propagating the suboptimal layout. This is what we are getting wrong (in addition to semantics change due to register ordering) in the current pass.