iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.82k stars 609 forks source link

[GPUCodegen] Characterize performance for dynamic fused self attention #18931

Open manupak opened 1 week ago

manupak commented 1 week ago

We would like to know performance characteristics for dynamic parameters for fused self attention.

M dynamic lengths : 1024, 2048, 3072, 4096, 5120, 6144, 7168, 8192, 16384 M tile sizes : 16, 32, 64 and 128 K1 values to be used : 64, 128 K2 values : this should be equal to M to be self-attention

manupak commented 1 week ago

@MaheshRavishankar @Groverkss I ve summarized the info about what needs to be analyzed here. Let me know if this has to be something different.

I think for llm s we only care about self-attention -- thus K2 = M.

Groverkss commented 1 week ago

For K1/N you can use 64/128. You can probably ignore 256.

manupak commented 6 days ago

Do we want this done for MI300X in CPX ?

Groverkss commented 6 days ago

Do we want this done for MI300X in CPX ?

SPX/CPX either on MI300X should be fine.

manupak commented 6 days ago

I ll start with CPX then...

MaheshRavishankar commented 6 days ago

Either should be fine, but I think we have SPX available more easily. The trends should be the same.

manupak commented 6 days ago
!dtype = f16
!Q     = tensor<1x?x64xf16>
!K     = tensor<1x?x64xf16>
!V     = tensor<1x?x64xf16>
!O     = tensor<1x?x64xf16>

#tuning = #iree_codegen.compilation_info<lowering_config = #iree_gpu.lowering_config<{ workgroup = [1, 64, 0, 0, 0], reduction = [0, 0, 0, 0, 32] }>, translation_info = #iree_codegen.translation_info<LLVMGPUVectorDistribute workgroup_size = [64, 4] subgroup_size = 64 ,{mma_schedule = #iree_gpu.mma_schedule<intrinsic = #iree_gpu.mma_layout<MFMA_F32_32x32x8_F16>, subgroup_m_count = 4, subgroup_n_count = 1> , llvm_func_attrs = { "amdgpu-waves-per-eu" = "2","denormal-fp-math-f32" = "preserve-sign" }}>>

#Q = affine_map<(b, m, n, k1, k2) -> (b, m, k1)>
#K = affine_map<(b, m, n, k1, k2) -> (b, k2, k1)>
#V = affine_map<(b, m, n, k1, k2) -> (b, k2, n)>
#S = affine_map<(b, m, n, k1, k2) -> ()>
#O = affine_map<(b, m, n, k1, k2) -> (b, m, n)>

func.func @main(%Q : !Q, %K : !K, %V : !V) -> !O {
  %scale = arith.constant 1.0 : !dtype
  %c1 = arith.constant 1 : index
  %size1 = tensor.dim %Q, %c1 : !O
  %empty = tensor.empty(%size1) : !O
  %O = iree_linalg_ext.attention 
       { indexing_maps = [#Q, #K, #V, #S, #O]
         ,compilation_info = #tuning
       }
       ins(%Q, %K, %V, %scale : !Q, !K, !V, !dtype) outs(%empty : !O) {
          ^bb0(%score: f32):
            iree_linalg_ext.yield %score : f32
        } -> !O
  return %O : !O
}

I ve managed to generate IRs as above but its not compiling as of now. Im suspecting Im missing something much simpler given that dynamic attention kernels are supported., yes?

is there any test/example of a dynamic attention kernel in the codebase?

MaheshRavishankar commented 6 days ago

Maybe we should try static sizes for those shapes. Making the shape dynamic will not give us as clear a signal yet.

manupak commented 4 days ago

@Groverkss @MaheshRavishankar ,

following https://github.com/iree-org/iree/pull/18937, I think there is some misalignment.

IIUC PR, its looking at shape of the attention, to whether to use fused attention or not.

Is this for decode attention ? If so, I think this should be changed to

M = 16, 32, 64, 128 sort of lower range. K2 = 1024, 2048, 3072, 4096, 5120, 6144, 7168, 8192, 16384

manupak commented 3 days ago

Here is the data:

https://amdcloud-my.sharepoint.com/:x:/g/personal/mkarunar_amd_com/EUn8SRgVutlAtODEUMPKYW0BEM56zrKPlPYyrxKCce8DvQ?e=1EFf0a&nav=MTVfezBCNjA4MzJCLTM1MDEtNEUzOS1BRDM3LTYxMDdCRDI1NTVCRX0

ScottTodd commented 3 days ago

^ Please use public links when working in open source. Google Drive files and GitHub gists are both good options. Uploading zip files to issues is also an option, but binary files can be security risks.

manupak commented 3 days ago

iree_attention_decompose.xlsx