[XPU][OptRed] Revamp `-tritonintelgpu-optimize-reduction-locality` - Githubissues

intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs

MIT License

144 stars 44 forks source link

[XPU][OptRed] Revamp `-tritonintelgpu-optimize-reduction-locality` #2800

Open victor-eds opened 3 days ago

victor-eds commented 3 days ago

Original implementation had two critical issues:

Functional: It did not preserve register order, so it was computing a different reduction.
Performance: When converting back to the original tensor type, it did: reshape(convert_layout(res)). That means the reshape operation served as an anchor and the suboptimal slice layout was propagated.

This was fixed as follows:

Keep register order.
Do convert_layout(reshape(res)) when converting back to the original type, thus propagating the more optimal layout.

See implementation for further details.

Closes #2752