[RAISE-BP] Check `-triton-raise-block-pointer` output leads to better codegen - Githubissues

intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs

MIT License

131 stars 39 forks source link

[RAISE-BP] Check `-triton-raise-block-pointer` output leads to better codegen #1431

Closed victor-eds closed 1 month ago

victor-eds commented 3 months ago

As @mfrancepillois said in https://github.com/intel/intel-xpu-backend-for-triton/pull/1395#pullrequestreview-2128572018, this pass in itself won't lead to better performance. Check the pipeline is capable of modifying tt.load encoding thus leading to better codegen. If not, create followup issues to fix this.

A good indicative of this would be tt.* operations being lowered to 2D block memory access operations.

mfrancepillois commented 3 months ago

Picked the task recently. Work in progress.

mfrancepillois commented 3 months ago

The purpose of raising unstructured load/store to block pointer is to be able to use 2D block load/store (i.e. TritonGEN::Matrix2DBlockLoadOp and TritonGEN::Matrix2DBlockStoreOp). We should therefore ensure that the raised code contains enough information to be enhanced with the dot layout and dpas encoding needed to lower load and store to 2D block load/store.

mfrancepillois commented 3 months ago

The PR needs some rework, as a unit test is not a appropriate to test this.

mfrancepillois commented 2 months ago

We are discussing the best way to carry out this multi-layers test.

mfrancepillois commented 2 months ago

Still looking for the best way to carry out this multi-layers test.

mfrancepillois commented 2 months ago

Waiting for the pass to be able to handle real use cases, such as the 03 tutorial, so that we can set-up e2e tests using real use cases.

mfrancepillois commented 2 months ago

Waiting for the pass to be able to handle real use cases, such as the 03 tutorial, so that we can set-up e2e tests using real use cases.

mfrancepillois commented 2 months ago

Waiting for the pass to be able to handle real use cases, such as the 03 tutorial, so that we can set-up e2e tests using real use cases.

mfrancepillois commented 1 month ago

The pass was used to raise the memory access of the 03 tutorial. The raised 03 tutorial exposes similar performance to matrix multiplication using user block pointer ops (see below). This therefore has proven that the generated code is better than unstructured accesses. This issue can therefore be closed.

M	N	K	Vanilla 03 tutorial	Raised 03 tutorial	03 tutorial using block pointers
256.0	256.0	256.0	0.672164	2.41052	2.162012
384.0	384.0	384.0	1.579886	5.617371	4.984428
512.0	512.0	512.0	2.843596	10.292771	8.971773
640.0	640.0	640.0	4.3749	12.412121	12.412121
768.0	768.0	768.0	6.347882	17.750189	17.806008
896.0	896.0	896.0	8.563371	23.787141	23.538061
1024.0	1024.0	1024.0	6.939903	25.565282	23.403267
1152.0	1152.0	1152.0	8.67861	27.939032	27.068409
1280.0	1280.0	1280.0	10.68667	36.92169	37.130877
1408.0	1408.0	1408.0	7.919057	29.518923	27.991469
1536.0	1536.0	1536.0	9.409739	32.44877	30.058715
1664.0	1664.0	1664.0	9.6584	37.049237	33.898196
1792.0	1792.0	1792.0	11.34758	39.285809	38.099742
1920.0	1920.0	1920.0	9.523531	44.526221	43.105284
2048.0	2048.0	2048.0	10.333383	48.541672	47.395359
2176.0	2176.0	2176.0	9.813795	43.717362	42.639082
2304.0	2304.0	2304.0	10.866613	41.759733	41.252666
2432.0	2432.0	2432.0	10.993187	49.369735	47.807647
2560.0	2560.0	2560.0	12.245428	53.690527	52.389509
2688.0	2688.0	2688.0	13.339097	50.804973	51.131333
2816.0	2816.0	2816.0	11.991706	56.43569	54.221239
2944.0	2944.0	2944.0	11.092583	57.447874	56.22752
3072.0	3072.0	3072.0	11.976993	57.059969	56.680673
3200.0	3200.0	3200.0	13.018879	59.199304	58.322654
3328.0	3328.0	3328.0	12.437926	60.872544	58.727205
3456.0	3456.0	3456.0	11.729972	53.281498	53.513591
3584.0	3584.0	3584.0	12.463366	54.910163	54.21438
3712.0	3712.0	3712.0	13.126576	60.04912	58.784756
3840.0	3840.0	3840.0	12.939939	61.161273	59.918629
3968.0	3968.0	3968.0	12.4729	61.000056	60.396207
4096.0	4096.0	4096.0	12.986129	62.476793	61.678287

victor-eds commented 1 month ago

Raising leads to expected codegen as per the comment above. This issue can be closed.