Vectorized matmul performance regression - function inlining

jtuyls commented 2 weeks ago

We're seeing performance regression on vectorized matmul, likely caused by the following PR: https://github.com/nod-ai/iree-amd-aie/pull/856, see table below:

Matmul problem size: 512x512x4096 (MxKxN) Array configuration: 2x2 Vectorization or ukernel or scalar: Vectorization

Commit	Latency (us)
12f0502	48521
2086718	42513

@Abhishek-Varma

Note that there is another PR causing performance regression: https://github.com/nod-ai/iree-amd-aie/issues/882, which is likely orthogonal.

newling commented 2 weeks ago

Can someone please remind me -- at what granularity is the matmul outlined? Is it at the m=n=k=64 granularity or the m=n=4 k=8 granularity (assuming phoenix bf16) ?

Abhishek-Varma commented 2 weeks ago

Can someone please remind me -- at what granularity is the matmul outlined? Is it at the m=n=k=64 granularity or the m=n=4 k=8 granularity (assuming phoenix bf16) ?

It takes place at the latter granularity.

Here's the outlined matmul :-

func.func private @generic_matmul_outlined(%arg0: memref<1x1x4x4x4x8xbf16, 2 : i32>, %arg1: memref<1x1x4x4x8x4xbf16, 2 : i32>, %arg2: memref<1x1x4x4x4x4xf32, 2 : i32>) {
  linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%arg0, %arg1 : memref<1x1x4x4x4x8xbf16, 2 : i32>, memref<1x1x4x4x8x4xbf16, 2 : i32>) outs(%arg2 : memref<1x1x4x4x4x4xf32, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[32, 32], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [16, 16, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
  ^bb0(%in: bf16, %in_0: bf16, %out: f32):
    %0 = arith.extf %in : bf16 to f32
    %1 = arith.extf %in_0 : bf16 to f32
    %2 = arith.mulf %0, %1 : f32
    %3 = arith.addf %out, %2 : f32
    linalg.yield %3 : f32
  }
  return
}

Here's an e2e log (created earlier) for reference.

Abhishek-Varma commented 1 week ago

Not sure, but perhaps this might be the reason behind (and hopefully a fix for) this regression.

newling commented 1 week ago

Maybe, but it isn't surprising to me that outlining a single AIE instruction (matmul on 4x8x4) can result in a slow down

Abhishek-Varma commented 1 week ago

Maybe, but it isn't surprising to me that outlining a single AIE instruction (matmul on 4x8x4) can result in a slow down

Yeah, I guess outlining functions would definitely add some regression because of function invocation overhead.

As it was initially attempted to reduce the program memory requirement, it can definitely introduce performance overhead - perhaps the way forward should be "conditional" enabling of function outlining for now while the peano loop unrolling control is enabled ?

nod-ai / iree-amd-aie

Vectorized matmul performance regression - function inlining #883