chengjunlu commented 5 months ago

This issue is to track the tasks to improve the performance of the Triton GEMM/Flash attention kernel performance with the default Triton passes pipeline to reach a reasonable number. (~80% to XeTLA kernel).

There are a lot of different variant implementation of the Triton kernels for the in-variant algorithm GEMM/Flash attention. Two basic reasones that we need to enhance the kernel performance with the default Triton passes pipelining:

The user may use their preferred Triton syntax and way to implement the kernel which is not fit to our first class solution. (Typical Triton is obsoleting the BlockPointer.)
Some Triton ops doesn't support Block Pointer which is required by the Triton kernel. Like Atomic ops. (K-dim parallel reducing GEMM, FlashAttension V3 for K-dim parallel online softmax.)

It is important to support those long tail variants Triton kernel to get a reasonable performance.

chengjunlu commented 5 months ago

The tasks to support this of the insights for PVC now:

Setup a benchmark tools and monitoring the performance in CI. -- #879
Support sub-group-size 32 for DPAS -- #416
Use the 2D load/store for the both the ToP (tensor of pointers) and PoT (pointer of tensor). (Ignore: nested ToPoToP....) -- PoT (pointer of tensor): #146 , #413 , #415, #1143 -- ToP (tensor of pointers): #880
To pipeline the loops body with heavy tt.dot ops with prefetching.
Reduce the number of the convert layout between the chained dot operations. -- Remove redundant convert layout: #894 , #951
To use the dense stride instead of warp stride in tiling the tt.dot operation. So that we can use packed 2D load/store to load the operands A and B. -- #969

I will create the sub issues to track the progress of each tasks individually.

tdeng5 commented 5 months ago

@chengjunlu, there is a similar issue #773 for improving GEMM and flash attention performance, could you please provide some examples which #773 cannot handle and go to this path?

chengjunlu commented 5 months ago

@chengjunlu, there is a similar issue #773 for improving GEMM and flash attention performance, could you please provide some examples which #773 cannot handle and go to this path?

Based on the discussion and plan, the #773 is focus on the Triton kernel with the block pointer. This issue is to track the Triton kernel of variant long tail cases besides the block pointers.

chengjunlu commented 5 months ago

As there is a trend in Triton community is going to obsolete the block pointer, May need to high priority of the task: -- ToP (tensor of pointers): https://github.com/intel/intel-xpu-backend-for-triton/issues/880

chengjunlu commented 5 months ago

The flash attention optimization notes. The most passes are aligned with the GEMM optimization. Here is some memo specific for the flash attention optimization:

We can remove the un-necessary convert layout for the Q matrix. The dot layout can be backward combine all the way back to the ops when load the Q operands. As show in the pieces of the MLIR. https://github.com/intel/intel-xpu-backend-for-triton/issues/950

%28 = tt.load %27 : tensor<128x16x!tt.ptr<f16>, #blocked> loc(#loc18)
%29 = tt.splat %16 : f32 -> tensor<128x16xf32, #blocked> loc(#loc19)
%30 = arith.extf %28 : tensor<128x16xf16, #blocked> to tensor<128x16xf32, #blocked> loc(#loc19)
%31 = arith.mulf %30, %29 : tensor<128x16xf32, #blocked> loc(#loc19)
%32 = arith.truncf %31 : tensor<128x16xf32, #blocked> to tensor<128x16xf16, #blocked> loc(#loc20)
%33:5 = scf.for %arg22 = %c0_i32 to %arg20 step %c64_i32 iter_args(%arg23 = %cst_3, %arg24 = %cst_2, %arg25 = %cst_1, %arg26 = %6, %arg27 = %8) -> (tensor<128x16xf32, #mma>, tensor<128xf32, #triton_gpu.slice<{dim = 1, parent = #mma}>>, tensor<128xf32, #triton_gpu.slice<{dim = 1, parent = #mma}>>, !tt.ptr<tensor<16x64xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #mma}>>>, !tt.ptr<tensor<64x16xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #mma}>>>)  : i32 {
      %49 = tt.load %arg26 : !tt.ptr<tensor<16x64xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #mma}>>> loc(#loc22)
      %50 = tt.load %arg27 : !tt.ptr<tensor<64x16xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #mma}>>> loc(#loc23)
      %51 = triton_gpu.convert_layout %32 : tensor<128x16xf16, #blocked> -> tensor<128x16xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #mma}>> loc(#loc20)
      %52 = tt.dot %51, %49, %cst, inputPrecision = tf32 : tensor<128x16xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #mma}>> * tensor<16x64xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #mma}>> -> tensor<128x64xf32, #mma> loc(#loc24)

We need to optimize the convert layout op lowering by not using SLM and extra synchronization for PVC, Dtype fp16/bf16. Because it is coincidence we can feed the output from the first DPAS inst directly as the operands A to the second DPAS. The tt.dot if fully paralleled on the row dimension. https://github.com/intel/intel-xpu-backend-for-triton/issues/951

#mma = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 8, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [4, 1], A = [8, 16], B = [16, 8], C = [8, 8]}>
...
%65 = triton_gpu.convert_layout %64 : tensor<128x64xf16, #mma> -> tensor<128x64xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #mma}>> loc(#loc36)

etiotto commented 5 months ago

Helping with refactoring the PR.

chengjunlu commented 3 months ago

Update the status for the flash attention performance on the fallback path from the issue https://github.com/intel/intel-xpu-backend-for-triton/issues/878.

The flash attention optimization notes. The most passes are aligned with the GEMM optimization. Here is some memo specific for the flash attention optimization:

Already finished:

Reduce the number of the convert layout between the chained dot operations. -- Remove redundant convert layout: #894
We can remove the un-necessary convert layout for the Q matrix. The dot layout can be backward combine all the way back to the ops when load the Q operands. As show in the pieces of the MLIR. https://github.com/intel/intel-xpu-backend-for-triton/issues/950

Things need to be finished:

We need to optimize the convert layout op lowering by not using SLM and extra synchronization for PVC, Dtype fp16/bf16. Because it is coincidence we can feed the output from the first DPAS inst directly as the operands A to the second DPAS. The tt.dot if fully paralleled on the row dimension. https://github.com/intel/intel-xpu-backend-for-triton/issues/951

chengjunlu commented 3 months ago

Update the status for the GEMM performance on the fallback path from the issue #878.

Already finished:

Setup a benchmark tools and monitoring the GEMM performance in CI. -- #879
Support sub-group-size 32 for DPAS -- #416
Use the 2D load/store for the both the ToP (tensor of pointers) and PoT (pointer of tensor). (Ignore: nested ToPoToP....) -- PoT (pointer of tensor): #146 , #413 , #415 -- ToP (tensor of pointers):
To pipeline the loops body with heavy tt.dot ops with prefetching.

Things need to be finished:

Use the 2D load/store for the both the ToP (tensor of pointers) and PoT (pointer of tensor). -- #1143 -- #880
Reduce the number of the convert layout between the chained dot operations. -- #951
To use the dense stride instead of warp stride in tiling the tt.dot operation. So that we can use packed 2D load/store to load the operands A and B. -- #969
Enable the large GRF as a turnable options for GEMM performance. -- https://github.com/intel/intel-xpu-backend-for-triton/issues/1455
Support the column major matrix for B. -- https://github.com/intel/intel-xpu-backend-for-triton/issues/965

chengjunlu commented 3 weeks ago

@etiotto I'd like to close this issue as all the tasks have been finished.

And the new changes and tasks have been tracked in the new issue: https://github.com/intel/intel-xpu-backend-for-triton/issues/2177

Any concern to close this issue as finished?

etiotto commented 2 weeks ago

950 is still open but I think we can track that one separately in #2177. So yes.

intel / intel-xpu-backend-for-triton

[Performance] Enhance the Triton GEMM/Flash attention kernel performance for the default Triton passes pipeline #878

950 is still open but I think we can track that one separately in #2177. So yes.