intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
131 stars 39 forks source link

[RAISE-BP] Check `-triton-raise-block-pointer` output leads to better codegen #1431

Closed victor-eds closed 1 month ago

victor-eds commented 3 months ago

As @mfrancepillois said in https://github.com/intel/intel-xpu-backend-for-triton/pull/1395#pullrequestreview-2128572018, this pass in itself won't lead to better performance. Check the pipeline is capable of modifying tt.load encoding thus leading to better codegen. If not, create followup issues to fix this.

A good indicative of this would be tt.* operations being lowered to 2D block memory access operations.

mfrancepillois commented 3 months ago

Picked the task recently. Work in progress.

mfrancepillois commented 3 months ago

The purpose of raising unstructured load/store to block pointer is to be able to use 2D block load/store (i.e. TritonGEN::Matrix2DBlockLoadOp and TritonGEN::Matrix2DBlockStoreOp). We should therefore ensure that the raised code contains enough information to be enhanced with the dot layout and dpas encoding needed to lower load and store to 2D block load/store.

mfrancepillois commented 3 months ago

The PR needs some rework, as a unit test is not a appropriate to test this.

mfrancepillois commented 2 months ago

We are discussing the best way to carry out this multi-layers test.

mfrancepillois commented 2 months ago

Still looking for the best way to carry out this multi-layers test.

mfrancepillois commented 2 months ago

Waiting for the pass to be able to handle real use cases, such as the 03 tutorial, so that we can set-up e2e tests using real use cases.

mfrancepillois commented 2 months ago

Waiting for the pass to be able to handle real use cases, such as the 03 tutorial, so that we can set-up e2e tests using real use cases.

mfrancepillois commented 2 months ago

Waiting for the pass to be able to handle real use cases, such as the 03 tutorial, so that we can set-up e2e tests using real use cases.

mfrancepillois commented 1 month ago

The pass was used to raise the memory access of the 03 tutorial. The raised 03 tutorial exposes similar performance to matrix multiplication using user block pointer ops (see below). This therefore has proven that the generated code is better than unstructured accesses. This issue can therefore be closed.

 
M​
N​ K​ Vanilla 03 tutorial​ Raised 03 tutorial​ 03 tutorial using block pointers​
256.0​ 256.0​ 256.0​ 0.672164​ 2.41052​ 2.162012​
384.0​ 384.0​ 384.0​ 1.579886​ 5.617371​ 4.984428​
512.0​ 512.0​ 512.0​ 2.843596​ 10.292771​ 8.971773​
640.0​ 640.0​ 640.0​ 4.3749​ 12.412121​ 12.412121​
768.0​ 768.0​ 768.0​ 6.347882​ 17.750189​ 17.806008​
896.0​ 896.0​ 896.0​ 8.563371​ 23.787141​ 23.538061​
1024.0​ 1024.0​ 1024.0​ 6.939903​ 25.565282​ 23.403267​
1152.0​ 1152.0​ 1152.0​ 8.67861​ 27.939032​ 27.068409​
1280.0​ 1280.0​ 1280.0​ 10.68667​ 36.92169​ 37.130877​
1408.0​ 1408.0​ 1408.0​ 7.919057​ 29.518923​ 27.991469​
1536.0​ 1536.0​ 1536.0​ 9.409739​ 32.44877​ 30.058715​
1664.0​ 1664.0​ 1664.0​ 9.6584​ 37.049237​ 33.898196​
1792.0​ 1792.0​ 1792.0​ 11.34758​ 39.285809​ 38.099742​
1920.0​ 1920.0​ 1920.0​ 9.523531​ 44.526221​ 43.105284​
2048.0​ 2048.0​ 2048.0​ 10.333383​ 48.541672​ 47.395359​
2176.0​ 2176.0​ 2176.0​ 9.813795​ 43.717362​ 42.639082​
2304.0​ 2304.0​ 2304.0​ 10.866613​ 41.759733​ 41.252666​
2432.0​ 2432.0​ 2432.0​ 10.993187​ 49.369735​ 47.807647​
2560.0​ 2560.0​ 2560.0​ 12.245428​ 53.690527​ 52.389509​
2688.0​ 2688.0​ 2688.0​ 13.339097​ 50.804973​ 51.131333​
2816.0​ 2816.0​ 2816.0​ 11.991706​ 56.43569​ 54.221239​
2944.0​ 2944.0​ 2944.0​ 11.092583​ 57.447874​ 56.22752​
3072.0​ 3072.0​ 3072.0​ 11.976993​ 57.059969​ 56.680673​
3200.0​ 3200.0​ 3200.0​ 13.018879​ 59.199304​ 58.322654​
3328.0​ 3328.0​ 3328.0​ 12.437926​ 60.872544​ 58.727205​
3456.0​ 3456.0​ 3456.0​ 11.729972​ 53.281498​ 53.513591​
3584.0​ 3584.0​ 3584.0​ 12.463366​ 54.910163​ 54.21438​
3712.0​ 3712.0​ 3712.0​ 13.126576​ 60.04912​ 58.784756​
3840.0​ 3840.0​ 3840.0​ 12.939939​ 61.161273​ 59.918629​
3968.0​ 3968.0​ 3968.0​ 12.4729​ 61.000056​ 60.396207​
4096.0​ 4096.0​ 4096.0​ 12.986129​ 62.476793​ 61.678287​

victor-eds commented 1 month ago

Raising leads to expected codegen as per the comment above. This issue can be closed.