Improve GEMM performance of shape 4096x8x128x16384

intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs

MIT License

144 stars 44 forks source link

Closed ESI-SYD closed 2 weeks ago

ESI-SYD commented 2 weeks ago

This change (grid order adjustment to improve cache hit) originating from https://github.com/intel/intel-xpu-backend-for-triton/pull/2600. Batched gemm only. ~99% of XeTLA for 4096x8x128x16384.