[#6 GEMM Performance] enable stream K for gemm

Dewei-Wang-sh commented 2 months ago

enable feature- streamK or splitK

LiyangLingIntel commented 3 weeks ago

A draft version of GEMM stream K kernel is placed here 10-experimental-block-pointer-streamk.py. It can produce correct results on the fallback path. To measure and get the best performance, it needs to backport a series of features including atomic operations support to block pointer path.

LiyangLingIntel commented 3 weeks ago

For the Split-K kernel, I think we can reuse https://github.com/triton-lang/kernels/blob/main/kernels/matmul.py from upstream with minor changes.

LiyangLingIntel commented 2 weeks ago

A draft version of GEMM stream K kernel is placed here 10-experimental-block-pointer-streamk.py. It can produce correct results on the fallback path. To measure and get the best performance, it needs to backport a series of features including atomic operations support to block pointer path.

After deeper investigation into this task, to support Atomic Op in the BlockPtr Path, we have to support operations with tensor of pointers and include more blocked layouts with potential layout conversion. This is complex and not considered in previous BlockPtr path design. After offline discussion with @Dewei-Wang-sh, we need more time to redesign lowering strategy to make both BlockPtr and TensorOfPtr work efficiently. So, we plan to separate this task to two:

For this task #1104: implemented StreamK kernel with block pointer, which is submitted in PR https://github.com/intel/intel-xpu-backend-for-triton/pull/1564
Add Atomic support to BlockPtr path and get 90% perf data for StreamK to XeTLA: https://github.com/intel/intel-xpu-backend-for-triton/issues/1575

intel / intel-xpu-backend-for-triton

[#6 GEMM Performance] enable stream K for gemm #1104