Closed Dewei-Wang-sh closed 2 weeks ago
A draft version of GEMM stream K kernel is placed here 10-experimental-block-pointer-streamk.py. It can produce correct results on the fallback path. To measure and get the best performance, it needs to backport a series of features including atomic operations support to block pointer path.
For the Split-K kernel, I think we can reuse https://github.com/triton-lang/kernels/blob/main/kernels/matmul.py from upstream with minor changes.
A draft version of GEMM stream K kernel is placed here 10-experimental-block-pointer-streamk.py. It can produce correct results on the fallback path. To measure and get the best performance, it needs to backport a series of features including atomic operations support to block pointer path.
After deeper investigation into this task, to support Atomic Op in the BlockPtr Path, we have to support operations with tensor of pointers and include more blocked layouts with potential layout conversion. This is complex and not considered in previous BlockPtr path design. After offline discussion with @Dewei-Wang-sh, we need more time to redesign lowering strategy to make both BlockPtr and TensorOfPtr work efficiently. So, we plan to separate this task to two:
enable feature- streamK or splitK