Coalescing the loops results in the single index being decomposed as
%43:3 = affine.delinearize_index %arg4 into (%c8, %c8, %c4) : index, index, index
With %43#0, %43#1, and %43#2 being used as offsets in the vector.transfer_read ops.
This affine op eventually gets updated in lower-affine into about 20 arith ops, which later lower to llvm scalar ops. The full compute block is included below.
This PR results in m=n=512 k=4096 kernel execution time dropping from 52 [ms] to 38 [ms] on ToM. This is good, but there is still another regression hidden somewhere (I think we should be at or below 20 [ms]).
Coalescing the loops results in the single index being decomposed as
With %43#0, %43#1, and %43#2 being used as offsets in the
vector.transfer_read
ops.This affine op eventually gets updated in
lower-affine
into about 20 arith ops, which later lower to llvm scalar ops. The full compute block is included below.This is the big difference I can see that the change to vectorization in https://github.com/nod-ai/iree-amd-aie/pull/856 introduces.
This PR results in m=n=512 k=4096 kernel execution time dropping from 52 [ms] to 38 [ms] on ToM. This is good, but there is still another regression hidden somewhere (I think we should be at or below 20 [ms]).