nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
70 stars 30 forks source link

Compile all matmul shapes is SD3 model for AIE #285

Closed newling closed 6 months ago

newling commented 7 months ago

Tracker task for the shapes we should support with direct codegen

2 compiler errors need resolving. The errors can be reproduced with square matmuls.

Error 1: m=n=k=2304 (=2048+256) bf16 operands f32 output.

The failure is in irrt-to-ipu

The error message is

// iree-compile: iree-amd-aie/third_party/mlir-air/mlir/lib/Conversion/AIRRtToIpuPass.cpp:551: void 
(anonymous namespace)::tileIllegalWrapDim(airrt::DmaMemcpyNdOp): Assertion 
`!(const_wrap % (AIE2_WRAP_UPPER_BOUND / 2)) && "Currently do not support remainder tiles"' failed.

@erwei-xilinx is aware of this: https://teams.microsoft.com/l/message/19:meeting_Zjc5ZmZhM2EtZDcxZS00NzYxLTliYmQtNGFlNzY1MDJhNjMy@thread.v2/1713812572221?context=%7B%22contextType%22%3A%22chat%22%7D

Error 2: m=n=k=2432 (=2048+256+128) bf16 operands f32 output.

Failure in air-label-scf-for-to-ping-pong

Error message is quite generic, needs further understanding:

error: block with no terminator, has %40 = "air.wait_all"(%arg8) : (!air.async.token) -> !air.async.token
note: see current operation: %40 = "air.wait_all"(%arg8) : (!air.async.token) -> !air.async.token

Example input IR:

!lhs = tensor<2432x2432xbf16>
!rhs = tensor<2432x2432xbf16>
!out = tensor<2432x2432xf32>
func.func @matmul_32x32_32xf32_(%lhs : !lhs, %rhs : !rhs) -> !out {
  %init_acc = tensor.empty() : !out
  %c0_acc_type = arith.constant 0.0 : f32
  %acc = linalg.fill ins(%c0_acc_type : f32) outs(%init_acc : !out) -> !out
  %result = linalg.matmul ins(%lhs, %rhs: !lhs, !rhs) outs(%acc: !out) -> !out
  return %result: !out
}
newling commented 7 months ago

Investigation of Error 2:

m=n=k=2304 does not fail compilation as early as m=n=k=2432 so I compare the IR generated by passes between these 2, to see where the 2432 case goes hors piste.

I see the first big difference in the pass 'AIRSpecializeDmaBroadcast where 2304 (the "good" case) generates a bunch of

 %11 = affine.if #set()[%arg12, %arg13] -> !air.async.token {
  ... 
 }

blocks. It seems suspicious to me that there are no equivalent blocks for the 2432 case, as 2432 is evenly less well divided by 2 than is 2304.

The next pass, DmaToChannel, sees the 2432 IR explode to 707 lines, while the "bad" 2304 case sees the IR size at 315. This again is suspicious: why is the IR with a less divisible tensor size and more elements smaller?

Update: 'AMDAIEPackToDma' is not converting some packs to dmas. In the (bad) 2432 case these aren't converted:

 iree_linalg_ext.pack %subview_5 padding_value(%cst : bf16) outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [4, 8] into %alloc_6 : (memref<38x8xbf16, strided<[2432, 1], offset: ?>, 1 : i32> memref<1x10x4x8xbf16, 2 : i32>)

 iree_linalg_ext.pack %subview_7 padding_value(%cst : bf16) outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [8, 4] into %alloc_8 : (memref<8x38xbf16, strided<[152, 1], offset: ?>, 1 : i32> memref<10x1x8x4xbf16, 2 : i32>)

while in the 2304 (good) case these packs are turned into dma copies:

 iree_linalg_ext.pack %subview_5 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [4, 8] into %alloc_6 : (memref<64x64xbf16, strided<[2304, 1], offset: ?>, 1 : i32> memref<8x16x4x8xbf16, 2 : i32>)

iree_linalg_ext.pack %subview_7 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [8, 4] into %alloc_8 : (memref<64x64xbf16, strided<[256, 1], offset: ?>, 1 : i32> memref<16x8x8x4xbf16, 2 : i32>)
erwei-xilinx commented 7 months ago

I created a PR to MLIR-AIR which fixes the error 1. I was able to get the error 1's GEMM shape to pass locally (although using an older xrt built in Jan. 2024). However, when I try to land this in IREE-AMD-AIE, it fails in CI with an XRT issue:

[XRT] ERROR: Failed to allocate host memory buffer (mmap(len=10616832, prot=3, flags=8193, offset=4294967296) failed (err=11): Resource temporarily unavailable), make sure host bank is enabled (see xbutil configure --host-mem)

Does anyone know what that means? Pls see this PR for details: https://github.com/nod-ai/iree-amd-aie/pull/289

newling commented 7 months ago

The current status is summarised in https://github.com/nod-ai/iree-amd-aie/pull/286. There are 3 matmuls which hang during the execution. There is another reproducer for this hanging failure mode in: https://github.com/newling/iree-amd-aie/tree/hang_reproducer

The table on there summarizing a set of experiments to try and pin down the hanging behaviour is

# Table summarizing some results using bfloat16 operands and float32 result.
#
# M     K     N     Result
# ========================
# 2048  2048  2048  Pass
# 2048  2048  2432  Hang -------- X
# 2048  2432  2048  Pass
# 2048  2432  2432  Hang -------- X
# 2432  2048  2048  Pass
# 2432  2048  2432  Pass
# 2432  2432  2048  Pass
# 2432  2432  2432  Pass

# Similar table with instead of 2048, 512, and instead of 2432, 640. Same behaviour more or less:
# M     K     N     Result
# ========================
# 512   512   512   Pass
# 512   512   640   Hang -------- X
# 512   640   512   Pass
# 512   640   640   Hang -------- X
# 640   512   512   Pass
# 640   512   640   Pass
# 640   640   512   Pass
# 640   640   640   Pass
# ========================

# So rule is: hang iff M!=640 and N==640.

Note that 
640 = 512 + 128. 

# Note that the above tables are with bfloat16 operands, float32 result.
# With float32 operands and float32 result, the test passes in the above cases,
# but if N is halved in size then the test hangs again. Suggesting it has to
#with the number of bytes in the N dimension. For example,
# with M = 256 K = 256 N = 320
# we get a hang in float32.

I have not found anything obvious comparing the IRs for the different sizes.

FYI @erwei-xilinx

erwei-xilinx commented 7 months ago

The current status is summarised in https://github.com/nod-ai/iree-amd-aie/pull/286.

Thanks for doing the sweep. This is really helpful, and it helped us identify the issue which is specific to how air-split-l2-memref pass handles 4x2 herds, as we found out that all hanging cases were with 4x2 herds.

This PR in AIR should fix this issue. I was able to get those shapes to pass the test locally: https://github.com/Xilinx/mlir-air/pull/562