Open manupak opened 1 week ago
@MaheshRavishankar @Groverkss I ve summarized the info about what needs to be analyzed here. Let me know if this has to be something different.
I think for llm s we only care about self-attention -- thus K2 = M.
For K1/N you can use 64/128. You can probably ignore 256.
Do we want this done for MI300X in CPX ?
Do we want this done for MI300X in CPX ?
SPX/CPX either on MI300X should be fine.
I ll start with CPX then...
Either should be fine, but I think we have SPX available more easily. The trends should be the same.
!dtype = f16
!Q = tensor<1x?x64xf16>
!K = tensor<1x?x64xf16>
!V = tensor<1x?x64xf16>
!O = tensor<1x?x64xf16>
#tuning = #iree_codegen.compilation_info<lowering_config = #iree_gpu.lowering_config<{ workgroup = [1, 64, 0, 0, 0], reduction = [0, 0, 0, 0, 32] }>, translation_info = #iree_codegen.translation_info<LLVMGPUVectorDistribute workgroup_size = [64, 4] subgroup_size = 64 ,{mma_schedule = #iree_gpu.mma_schedule<intrinsic = #iree_gpu.mma_layout<MFMA_F32_32x32x8_F16>, subgroup_m_count = 4, subgroup_n_count = 1> , llvm_func_attrs = { "amdgpu-waves-per-eu" = "2","denormal-fp-math-f32" = "preserve-sign" }}>>
#Q = affine_map<(b, m, n, k1, k2) -> (b, m, k1)>
#K = affine_map<(b, m, n, k1, k2) -> (b, k2, k1)>
#V = affine_map<(b, m, n, k1, k2) -> (b, k2, n)>
#S = affine_map<(b, m, n, k1, k2) -> ()>
#O = affine_map<(b, m, n, k1, k2) -> (b, m, n)>
func.func @main(%Q : !Q, %K : !K, %V : !V) -> !O {
%scale = arith.constant 1.0 : !dtype
%c1 = arith.constant 1 : index
%size1 = tensor.dim %Q, %c1 : !O
%empty = tensor.empty(%size1) : !O
%O = iree_linalg_ext.attention
{ indexing_maps = [#Q, #K, #V, #S, #O]
,compilation_info = #tuning
}
ins(%Q, %K, %V, %scale : !Q, !K, !V, !dtype) outs(%empty : !O) {
^bb0(%score: f32):
iree_linalg_ext.yield %score : f32
} -> !O
return %O : !O
}
I ve managed to generate IRs as above but its not compiling as of now. Im suspecting Im missing something much simpler given that dynamic attention kernels are supported., yes?
is there any test/example of a dynamic attention kernel in the codebase?
Maybe we should try static sizes for those shapes. Making the shape dynamic will not give us as clear a signal yet.
@Groverkss @MaheshRavishankar ,
following https://github.com/iree-org/iree/pull/18937, I think there is some misalignment.
IIUC PR, its looking at shape of the attention, to whether to use fused attention or not.
Is this for decode attention ? If so, I think this should be changed to
M = 16, 32, 64, 128 sort of lower range. K2 = 1024, 2048, 3072, 4096, 5120, 6144, 7168, 8192, 16384
^ Please use public links when working in open source. Google Drive files and GitHub gists are both good options. Uploading zip files to issues is also an option, but binary files can be security risks.
We would like to know performance characteristics for dynamic parameters for fused self attention.
M dynamic lengths : 1024, 2048, 3072, 4096, 5120, 6144, 7168, 8192, 16384 M tile sizes : 16, 32, 64 and 128 K1 values to be used : 64, 128 K2 values : this should be equal to M to be self-attention