Closed newling closed 6 months ago
m=n=k=2304 does not fail compilation as early as m=n=k=2432 so I compare the IR generated by passes between these 2, to see where the 2432 case goes hors piste.
I see the first big difference in the pass 'AIRSpecializeDmaBroadcast
where 2304 (the "good" case) generates a bunch of
%11 = affine.if #set()[%arg12, %arg13] -> !air.async.token {
...
}
blocks. It seems suspicious to me that there are no equivalent blocks for the 2432 case, as 2432 is evenly less well divided by 2 than is 2304.
The next pass, DmaToChannel, sees the 2432 IR explode to 707 lines, while the "bad" 2304 case sees the IR size at 315. This again is suspicious: why is the IR with a less divisible tensor size and more elements smaller?
Update: 'AMDAIEPackToDma' is not converting some packs to dmas. In the (bad) 2432 case these aren't converted:
iree_linalg_ext.pack %subview_5 padding_value(%cst : bf16) outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [4, 8] into %alloc_6 : (memref<38x8xbf16, strided<[2432, 1], offset: ?>, 1 : i32> memref<1x10x4x8xbf16, 2 : i32>)
iree_linalg_ext.pack %subview_7 padding_value(%cst : bf16) outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [8, 4] into %alloc_8 : (memref<8x38xbf16, strided<[152, 1], offset: ?>, 1 : i32> memref<10x1x8x4xbf16, 2 : i32>)
while in the 2304 (good) case these packs are turned into dma copies:
iree_linalg_ext.pack %subview_5 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [4, 8] into %alloc_6 : (memref<64x64xbf16, strided<[2304, 1], offset: ?>, 1 : i32> memref<8x16x4x8xbf16, 2 : i32>)
iree_linalg_ext.pack %subview_7 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [8, 4] into %alloc_8 : (memref<64x64xbf16, strided<[256, 1], offset: ?>, 1 : i32> memref<16x8x8x4xbf16, 2 : i32>)
I created a PR to MLIR-AIR which fixes the error 1. I was able to get the error 1's GEMM shape to pass locally (although using an older xrt built in Jan. 2024). However, when I try to land this in IREE-AMD-AIE, it fails in CI with an XRT issue:
[XRT] ERROR: Failed to allocate host memory buffer (mmap(len=10616832, prot=3, flags=8193, offset=4294967296) failed (err=11): Resource temporarily unavailable), make sure host bank is enabled (see xbutil configure --host-mem)
Does anyone know what that means? Pls see this PR for details: https://github.com/nod-ai/iree-amd-aie/pull/289
The current status is summarised in https://github.com/nod-ai/iree-amd-aie/pull/286. There are 3 matmuls which hang during the execution. There is another reproducer for this hanging failure mode in: https://github.com/newling/iree-amd-aie/tree/hang_reproducer
The table on there summarizing a set of experiments to try and pin down the hanging behaviour is
# Table summarizing some results using bfloat16 operands and float32 result.
#
# M K N Result
# ========================
# 2048 2048 2048 Pass
# 2048 2048 2432 Hang -------- X
# 2048 2432 2048 Pass
# 2048 2432 2432 Hang -------- X
# 2432 2048 2048 Pass
# 2432 2048 2432 Pass
# 2432 2432 2048 Pass
# 2432 2432 2432 Pass
# Similar table with instead of 2048, 512, and instead of 2432, 640. Same behaviour more or less:
# M K N Result
# ========================
# 512 512 512 Pass
# 512 512 640 Hang -------- X
# 512 640 512 Pass
# 512 640 640 Hang -------- X
# 640 512 512 Pass
# 640 512 640 Pass
# 640 640 512 Pass
# 640 640 640 Pass
# ========================
# So rule is: hang iff M!=640 and N==640.
Note that
640 = 512 + 128.
# Note that the above tables are with bfloat16 operands, float32 result.
# With float32 operands and float32 result, the test passes in the above cases,
# but if N is halved in size then the test hangs again. Suggesting it has to
#with the number of bytes in the N dimension. For example,
# with M = 256 K = 256 N = 320
# we get a hang in float32.
I have not found anything obvious comparing the IRs for the different sizes.
FYI @erwei-xilinx
The current status is summarised in https://github.com/nod-ai/iree-amd-aie/pull/286.
Thanks for doing the sweep. This is really helpful, and it helped us identify the issue which is specific to how air-split-l2-memref
pass handles 4x2 herds, as we found out that all hanging cases were with 4x2 herds.
This PR in AIR should fix this issue. I was able to get those shapes to pass the test locally: https://github.com/Xilinx/mlir-air/pull/562
Tracker task for the shapes we should support with direct codegen
2 compiler errors need resolving. The errors can be reproduced with square matmuls.
Error 1: m=n=k=2304 (=2048+256) bf16 operands f32 output.
The failure is in
irrt-to-ipu
The error message is
@erwei-xilinx is aware of this: https://teams.microsoft.com/l/message/19:meeting_Zjc5ZmZhM2EtZDcxZS00NzYxLTliYmQtNGFlNzY1MDJhNjMy@thread.v2/1713812572221?context=%7B%22contextType%22%3A%22chat%22%7D
Error 2: m=n=k=2432 (=2048+256+128) bf16 operands f32 output.
Failure in
air-label-scf-for-to-ping-pong
Error message is quite generic, needs further understanding:
Example input IR: