Closed victor-eds closed 1 month ago
Picked the task recently. Work in progress.
The purpose of raising unstructured load/store to block pointer is to be able to use 2D block load/store (i.e. TritonGEN::Matrix2DBlockLoadOp
and TritonGEN::Matrix2DBlockStoreOp
).
We should therefore ensure that the raised code contains enough information to be enhanced with the dot
layout and dpas
encoding needed to lower load and store to 2D block load/store.
The PR needs some rework, as a unit test is not a appropriate to test this.
We are discussing the best way to carry out this multi-layers test.
Still looking for the best way to carry out this multi-layers test.
Waiting for the pass to be able to handle real use cases, such as the 03 tutorial, so that we can set-up e2e tests using real use cases.
Waiting for the pass to be able to handle real use cases, such as the 03 tutorial, so that we can set-up e2e tests using real use cases.
Waiting for the pass to be able to handle real use cases, such as the 03 tutorial, so that we can set-up e2e tests using real use cases.
The pass was used to raise the memory access of the 03 tutorial. The raised 03 tutorial exposes similar performance to matrix multiplication using user block pointer ops (see below). This therefore has proven that the generated code is better than unstructured accesses. This issue can therefore be closed.
Raising leads to expected codegen as per the comment above. This issue can be closed.
As @mfrancepillois said in https://github.com/intel/intel-xpu-backend-for-triton/pull/1395#pullrequestreview-2128572018, this pass in itself won't lead to better performance. Check the pipeline is capable of modifying
tt.load
encoding thus leading to better codegen. If not, create followup issues to fix this.A good indicative of this would be
tt.*
operations being lowered to 2D block memory access operations.