Open yzhang93 opened 8 months ago
@erwei-xilinx Can we first try to solve these two failed cases 8 x 50272 x 2048
and 8 x 8192 x 2048
? These are the OPT GEMM sizes. We can try to solve the 8192 x 8192 x k
later.
@yzhang93 Sure, sounds good to me.
@yzhang93 @erwei-xilinx For me 8192x2048x8192
fails at xclbin generation (see error below), but I don't think this shape needs to be supported for OPT right now (https://github.com/nod-ai/iree-amd-aie/blob/main/tests/OPT/failing_tests/gemm_i32.mlir)?. Maybe it's useful to add the list of all OPT shapes to your description above to track passing shapes and issues?
****** Bootgen v2023.2
**** Build date : Feb 20 2024-08:50:06
** Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.
** Copyright 2022-2023 Advanced Micro Devices, Inc. All Rights Reserved.
[INFO] : Bootimage generated successfully
XRT Build Version: 2.17.0 (HEAD)
Build Date: 2024-02-20 06:19:26
Hash ID: 22150929f8c3c19b514a515168fce1797db7f28c
Creating a default 'in-memory' xclbin image.
Section: 'MEM_TOPOLOGY'(6) was successfully added.
Size : 88 bytes
Format : JSON
File : '/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb/mem_topology.json'
Section: 'AIE_PARTITION'(32) was successfully added.
Size : 14848 bytes
Format : JSON
File : '/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb/aie_partition.json'
Info: Embedded Metadata section is missing project.platform.device.core element, adding it.
ERROR: The m_name entry length (73), exceeds the allocated space (64). Name: 'matmul_8192x2048_8192xi32__dispatch_0_matmul_8192x2048x8192_i32:MLIRAIEV1'
loc("/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb.aiecc.mlir":1:1): error: 'builtin.module' op failed to execute xclbinutil
loc("/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb.aiecc.mlir":1:1): error: 'builtin.module' op Failed to generate XCLBin
Update: I see a similar error for 8x2048x2048
:
...
ERROR: The m_name entry length (67), exceeds the allocated space (64). Name: 'matmul_8x2048_2048xi32__dispatch_0_matmul_8x2048x2048_i32:MLIRAIEV1'
loc("/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8x2048_2048xi32__dispatch_0_amdaie_xclbin_fb/module_matmul_8x2048_2048xi32__dispatch_0_amdaie_xclbin_fb.aiecc.mlir":1:1): error: 'builtin.module' op failed to execute xclbinutil
loc("/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8x2048_2048xi32__dispatch_0_amdaie_xclbin_fb/module_matmul_8x2048_2048xi32__dispatch_0_amdaie_xclbin_fb.aiecc.mlir":1:1): error: 'builtin.module' op Failed to generate XCLBin
Update: I see a similar error for 8x2048x2048:
Hmm I was unable to reproduce the error. I could get the iree-amd-aie + peano to generate the xclbin using the following workflow:
${IREE_COMPILE} --iree-hal-target-backends=amd-aie \
./input.mlir \
--iree-amdaie-use-pipeline=pad-pack \
--iree-amd-aie-peano-install-dir=/scratch/erweiw/peano/build/install \
--iree-amd-aie-mlir-aie-install-dir=/scratch/erweiw/mlir-aie-standalone/mlir-aie/install \
--iree-amd-aie-vitis-install-dir=/proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2 \
--iree-hal-dump-executable-files-to=./iree-proj --iree-amd-aie-show-invoked-commands
Could you please provide more information?
Update: I see a similar error for 8x2048x2048:
Hmm I was unable to reproduce the error. I could get the iree-amd-aie + peano to generate the xclbin using the following workflow:
${IREE_COMPILE} --iree-hal-target-backends=amd-aie \ ./input.mlir \ --iree-amdaie-use-pipeline=pad-pack \ --iree-amd-aie-peano-install-dir=/scratch/erweiw/peano/build/install \ --iree-amd-aie-mlir-aie-install-dir=/scratch/erweiw/mlir-aie-standalone/mlir-aie/install \ --iree-amd-aie-vitis-install-dir=/proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2 \ --iree-hal-dump-executable-files-to=./iree-proj --iree-amd-aie-show-invoked-commands
Could you please provide more information?
Ok, thanks for confirming that you don't see this. I will first try to debug this on my end as it's probably a local issue then.
Update: I see a similar error for 8x2048x2048:
Hmm I was unable to reproduce the error. I could get the iree-amd-aie + peano to generate the xclbin using the following workflow:
${IREE_COMPILE} --iree-hal-target-backends=amd-aie \ ./input.mlir \ --iree-amdaie-use-pipeline=pad-pack \ --iree-amd-aie-peano-install-dir=/scratch/erweiw/peano/build/install \ --iree-amd-aie-mlir-aie-install-dir=/scratch/erweiw/mlir-aie-standalone/mlir-aie/install \ --iree-amd-aie-vitis-install-dir=/proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2 \ --iree-hal-dump-executable-files-to=./iree-proj --iree-amd-aie-show-invoked-commands
Could you please provide more information?
Ok, thanks for confirming that you don't see this. I will first try to debug this on my end as it's probably a local issue then.
This was caused by the kernel name being larger in the tests I was running with larger shapes just overflowing the max size enforced by the xclbinutil
tool. I created a PR to work around this issue: https://github.com/nod-ai/iree-amd-aie/pull/188
This was caused by the kernel name being larger in the tests I was running with larger shapes just overflowing the max size enforced by the xclbinutil tool. I created a PR to work around this issue: https://github.com/nod-ai/iree-amd-aie/pull/188
Thanks for looking into that. I think my build didn't run into this only because I was using an older version of mlir-aie.
Now both tests failed because of the error
<unknown>:0: error: 'aiex.ipu.dma_memcpy_nd' op Stride 2 exceeds the [1:1M] range.
<unknown>:0: note: see current operation: "aiex.ipu.dma_memcpy_nd"(%arg1) <{id = 1 : i64, metadata = @airMemcpyId10, operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 4, 4, 512, 16>, static_strides = array<i64: 16, 12869632, 25136>, x = 0 : i64, y = 0 : i64}> : (memref<51478528xi32>) -> ()
This failure requires some optimization on the BD's wrap-and-stride list.
Want to loop in @nirvedhmeshram to see if this is something he can help with. Otherwise, we'll wait for Erwei or someone else to take care of this issue.
Now both tests failed because of the error
<unknown>:0: error: 'aiex.ipu.dma_memcpy_nd' op Stride 2 exceeds the [1:1M] range. <unknown>:0: note: see current operation: "aiex.ipu.dma_memcpy_nd"(%arg1) <{id = 1 : i64, metadata = @airMemcpyId10, operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 4, 4, 512, 16>, static_strides = array<i64: 16, 12869632, 25136>, x = 0 : i64, y = 0 : i64}> : (memref<51478528xi32>) -> ()
This failure requires some optimization on the BD's wrap-and-stride list.
Want to loop in @nirvedhmeshram to see if this is something he can help with. Otherwise, we'll wait for Erwei or someone else to take care of this issue.
I think this has to be handled in air-to-aie lowering, so lets first see if @erwei-xilinx has some ideas on what to do for this case? Happy to help if there is a task there that I can help with.
@erwei-xilinx for the 2048x2048x8 I am seeing failure in the air-loop-fusion pass, can you please take a look at the IR below? https://gist.github.com/nirvedhmeshram/1e0eea3c1c34f7eb9a5a8c16c5f41eda
@erwei-xilinx for the 2048x2048x8 I am seeing failure in the air-loop-fusion pass, can you please take a look at the IR below? https://gist.github.com/nirvedhmeshram/1e0eea3c1c34f7eb9a5a8c16c5f41eda
OK. I'm in the process of rewriting part of the air-loop-fusion pass. Will keep you updated on that.
Comment from meeting - Planned way forward.
@erwei-xilinx Here is the other failure that I see with hanging in airrt-to-ipu https://gist.github.com/nirvedhmeshram/613c199575935ca3e4c7195b898d3094
@erwei-xilinx for the 2048x2048x8 I am seeing failure in the air-loop-fusion pass, can you please take a look at the IR below? https://gist.github.com/nirvedhmeshram/1e0eea3c1c34f7eb9a5a8c16c5f41eda
Ok I have created two PRs in AIR which will get this IR to successfully lower through AIR. After they land, plus the new memtile buffer allocator pass AIRSplitL2MemrefForBufferConstraint
, this size should work e2e. (https://github.com/Xilinx/mlir-air/pull/517 and https://github.com/Xilinx/mlir-air/pull/518)
Here's my AIR-only dev env to reproduce the flow: https://gist.github.com/erwei-xilinx/44c109a2e6106943fb89b64a219ff6cc
@erwei-xilinx Here is the other failure that I see with hanging in airrt-to-ipu https://gist.github.com/nirvedhmeshram/613c199575935ca3e4c7195b898d3094
With this PR this shape (8192x2048x2048) can lower through the pad-pack pipeline using 4x4 AIE tiles, and it works for both i32 and bf16 (i32 takes longer to compile, as i32 leads to longer IPU sequence).
I have tested a superset of OPT GEMM sizes, with M/N/K as [8, 16, 32, 64, 2048, 8192] combinations using the pad-pack pipeline. Most test cases have passed, and the failed cases are those when two of M/N/K dimensions are large, e.g.,
8192 x 8192 x k
,m x 8192 x 2048
,m x 8192 x 8192
. The good news is8192 x 2048 x 8192
compiles without problem.For the next goal, we should look into those large failure cases, e.g., OPT size
8 x 50272 x 2048
.