Closed etiotto closed 1 month ago
For GEMM + preOp (e.g. exp) applied to one input of tt.dot (https://github.com/intel/intel-xpu-backend-for-triton/blob/main/benchmarks/triton_kernels_benchmark/gemm_preop_exp_benchmark.py) PR #2346 improve performance by 16%.
For GEMM + matrix add (postOp), PR #2400 improves performance from ~66TFlops to ~215TFlops for a 8Kx8Kx8K shape (other shapes also improve).
This is done.
We have achieved good performance (relative to the XeTLA library) for a GEMM kernel (see http://benchmarks.glados.intel.com/d/1pXX4hUSz/microbenchmarks?orgId=1). Now is time to focus on improving performance of several variants of the GEMM workload:
tt.dot
(https://github.com/intel/intel-xpu-backend-for-triton/blob/main/benchmarks/triton_kernels_benchmark/gemm_preop_exp_benchmark.py)tt.dot
output (https://github.com/intel/intel-xpu-backend-for-triton/blob/main/benchmarks/triton_kernels_benchmark/gemm_postop_gelu_benchmark.py)tt.dot
output (https://github.com/intel/intel-xpu-backend-for-triton/blob/main/benchmarks/triton_kernels_benchmark/gemm_postop_addmatrix_benchmark.py)Work Items