Open jtuyls opened 2 weeks ago
Can someone please remind me -- at what granularity is the matmul outlined? Is it at the m=n=k=64 granularity or the m=n=4 k=8 granularity (assuming phoenix bf16) ?
Can someone please remind me -- at what granularity is the matmul outlined? Is it at the m=n=k=64 granularity or the m=n=4 k=8 granularity (assuming phoenix bf16) ?
It takes place at the latter granularity.
Here's the outlined matmul :-
func.func private @generic_matmul_outlined(%arg0: memref<1x1x4x4x4x8xbf16, 2 : i32>, %arg1: memref<1x1x4x4x8x4xbf16, 2 : i32>, %arg2: memref<1x1x4x4x4x4xf32, 2 : i32>) {
linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%arg0, %arg1 : memref<1x1x4x4x4x8xbf16, 2 : i32>, memref<1x1x4x4x8x4xbf16, 2 : i32>) outs(%arg2 : memref<1x1x4x4x4x4xf32, 2 : i32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[32, 32], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [16, 16, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
^bb0(%in: bf16, %in_0: bf16, %out: f32):
%0 = arith.extf %in : bf16 to f32
%1 = arith.extf %in_0 : bf16 to f32
%2 = arith.mulf %0, %1 : f32
%3 = arith.addf %out, %2 : f32
linalg.yield %3 : f32
}
return
}
Here's an e2e log (created earlier) for reference.
Not sure, but perhaps this might be the reason behind (and hopefully a fix for) this regression.
Maybe, but it isn't surprising to me that outlining a single AIE instruction (matmul on 4x8x4) can result in a slow down
Maybe, but it isn't surprising to me that outlining a single AIE instruction (matmul on 4x8x4) can result in a slow down
Yeah, I guess outlining functions would definitely add some regression because of function invocation overhead.
As it was initially attempted to reduce the program memory requirement, it can definitely introduce performance overhead - perhaps the way forward should be "conditional" enabling of function outlining for now while the peano loop unrolling control is enabled ?
We're seeing performance regression on vectorized matmul, likely caused by the following PR: https://github.com/nod-ai/iree-amd-aie/pull/856, see table below:
Matmul problem size: 512x512x4096 (MxKxN) Array configuration: 2x2 Vectorization or ukernel or scalar: Vectorization
@Abhishek-Varma
Note that there is another PR causing performance regression: https://github.com/nod-ai/iree-amd-aie/issues/882, which is likely orthogonal.