[AMDAIEVectorization] Do not coalesce the 3 loops

Coalescing the loops results in the single index being decomposed as

 %43:3 = affine.delinearize_index %arg4 into (%c8, %c8, %c4) : index, index, index

With %43#0, %43#1, and %43#2 being used as offsets in the vector.transfer_read ops.

This affine op eventually gets updated in lower-affine into about 20 arith ops, which later lower to llvm scalar ops. The full compute block is included below.

This is the big difference I can see that the change to vectorization in https://github.com/nod-ai/iree-amd-aie/pull/856 introduces.

This PR results in m=n=512 k=4096 kernel execution time dropping from 52 [ms] to 38 [ms] on ToM. This is good, but there is still another regression hidden somewhere (I think we should be at or below 20 [ms]).

       %332 = llvm.icmp "slt" %330, %7 : i64
      %333 = llvm.sub %17, %330 : i64
      %334 = llvm.select %332, %333, %330 : i1, i64
      %335 = llvm.sdiv %334, %18  : i64
      %336 = llvm.sub %17, %335 : i64
      %337 = llvm.select %332, %336, %335 : i1, i64
      %338 = llvm.srem %330, %18  : i64
      %339 = llvm.icmp "slt" %338, %7 : i64
      %340 = llvm.add %338, %18 : i64
      %341 = llvm.select %339, %340, %338 : i1, i64
      %342 = llvm.icmp "slt" %341, %7 : i64
      %343 = llvm.sub %17, %341 : i64
      %344 = llvm.select %342, %343, %341 : i1, i64
      %345 = llvm.sdiv %344, %9  : i64
      %346 = llvm.sub %17, %345 : i64
      %347 = llvm.select %342, %346, %345 : i1, i64
      %348 = llvm.srem %330, %9  : i64
      %349 = llvm.icmp "slt" %348, %7 : i64
      %350 = llvm.add %348, %9 : i64
      %351 = llvm.select %349, %350, %348 : i1, i64
      %352 = llvm.mul %351, %3 : i64
      %353 = llvm.mul %337, %18 : i64
      %354 = llvm.add %352, %353 : i64
      %355 = llvm.extractvalue %30[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)> 
      %356 = llvm.getelementptr %355[%354] : (!llvm.ptr, i64) -> !llvm.ptr, bf16
      %357 = llvm.load %356 : !llvm.ptr -> vector<32xbf16>
      %358 = llvm.mul %347, %16 : i64
      %359 = llvm.mul %351, %18 : i64
      %360 = llvm.add %358, %359 : i64
      %361 = llvm.extractvalue %31[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)> 
      %362 = llvm.getelementptr %361[%360] : (!llvm.ptr, i64) -> !llvm.ptr, bf16
      %363 = llvm.load %362 : !llvm.ptr -> vector<32xbf16>
      %364 = llvm.mul %337, %5 : i64
      %365 = llvm.add %358, %364 : i64
      %366 = llvm.extractvalue %32[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)> 
      %367 = llvm.getelementptr %366[%365] : (!llvm.ptr, i64) -> !llvm.ptr, f32
      %368 = llvm.load %367 {alignment = 4 : i64} : !llvm.ptr -> vector<16xf32>
      %369 = llvm.bitcast %368 : vector<16xf32> to vector<8xi64>
      %370 = "xllvm.intr.aie2.bf.mac16.conf"(%357, %363, %369, %15) : (vector<32xbf16>, vector<32xbf16>, vector<8xi64>, i32) -> vector<8xi64>
      %371 = llvm.bitcast %370 : vector<8xi64> to vector<16xf32>

nod-ai / iree-amd-aie

[AMDAIEVectorization] Do not coalesce the 3 loops #894