Improve out-of-box performance for GEMM kernels variants - Githubissues

intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs

MIT License

143 stars 44 forks source link

Improve out-of-box performance for GEMM kernels variants #2379

Closed etiotto closed 1 month ago

etiotto commented 1 month ago

We have achieved good performance (relative to the XeTLA library) for a GEMM kernel (see http://benchmarks.glados.intel.com/d/1pXX4hUSz/microbenchmarks?orgId=1). Now is time to focus on improving performance of several variants of the GEMM workload:

GEMM + preOp (e.g. exp) applied to one input of tt.dot (https://github.com/intel/intel-xpu-backend-for-triton/blob/main/benchmarks/triton_kernels_benchmark/gemm_preop_exp_benchmark.py)
GEMM + GELU applied to tt.dot output (https://github.com/intel/intel-xpu-backend-for-triton/blob/main/benchmarks/triton_kernels_benchmark/gemm_postop_gelu_benchmark.py)
GEMM + matrix addition applied to tt.dot output (https://github.com/intel/intel-xpu-backend-for-triton/blob/main/benchmarks/triton_kernels_benchmark/gemm_postop_addmatrix_benchmark.py)

Work Items

etiotto commented 1 month ago

For GEMM + preOp (e.g. exp) applied to one input of tt.dot (https://github.com/intel/intel-xpu-backend-for-triton/blob/main/benchmarks/triton_kernels_benchmark/gemm_preop_exp_benchmark.py) PR #2346 improve performance by 16%.

etiotto commented 1 month ago

For GEMM + matrix add (postOp), PR #2400 improves performance from ~66TFlops to ~215TFlops for a 8Kx8Kx8K shape (other shapes also improve).

etiotto commented 1 month ago

This is done.