nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
66 stars 29 forks source link

OPT failure tracker with pad-pack pipeline #185

Open yzhang93 opened 8 months ago

yzhang93 commented 8 months ago

I have tested a superset of OPT GEMM sizes, with M/N/K as [8, 16, 32, 64, 2048, 8192] combinations using the pad-pack pipeline. Most test cases have passed, and the failed cases are those when two of M/N/K dimensions are large, e.g., 8192 x 8192 x k, m x 8192 x 2048, m x 8192 x 8192. The good news is 8192 x 2048 x 8192 compiles without problem.

For the next goal, we should look into those large failure cases, e.g., OPT size 8 x 50272 x 2048.

yzhang93 commented 8 months ago

@erwei-xilinx Can we first try to solve these two failed cases 8 x 50272 x 2048 and 8 x 8192 x 2048? These are the OPT GEMM sizes. We can try to solve the 8192 x 8192 x k later.

erwei-xilinx commented 8 months ago

@yzhang93 Sure, sounds good to me.

jtuyls commented 8 months ago

@yzhang93 @erwei-xilinx For me 8192x2048x8192 fails at xclbin generation (see error below), but I don't think this shape needs to be supported for OPT right now (https://github.com/nod-ai/iree-amd-aie/blob/main/tests/OPT/failing_tests/gemm_i32.mlir)?. Maybe it's useful to add the list of all OPT shapes to your description above to track passing shapes and issues?

****** Bootgen v2023.2
  **** Build date : Feb 20 2024-08:50:06
    ** Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.
    ** Copyright 2022-2023 Advanced Micro Devices, Inc. All Rights Reserved.

[INFO]   : Bootimage generated successfully

XRT Build Version: 2.17.0 (HEAD)
       Build Date: 2024-02-20 06:19:26
          Hash ID: 22150929f8c3c19b514a515168fce1797db7f28c
Creating a default 'in-memory' xclbin image.

Section: 'MEM_TOPOLOGY'(6) was successfully added.
Size   : 88 bytes
Format : JSON
File   : '/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb/mem_topology.json'

Section: 'AIE_PARTITION'(32) was successfully added.
Size   : 14848 bytes
Format : JSON
File   : '/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb/aie_partition.json'
Info: Embedded Metadata section is missing project.platform.device.core element, adding it.
ERROR: The m_name entry length (73), exceeds the allocated space (64).  Name: 'matmul_8192x2048_8192xi32__dispatch_0_matmul_8192x2048x8192_i32:MLIRAIEV1'
loc("/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb.aiecc.mlir":1:1): error: 'builtin.module' op failed to execute xclbinutil
loc("/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb/module_matmul_8192x2048_8192xi32__dispatch_0_amdaie_xclbin_fb.aiecc.mlir":1:1): error: 'builtin.module' op Failed to generate XCLBin

Update: I see a similar error for 8x2048x2048:

...
ERROR: The m_name entry length (67), exceeds the allocated space (64).  Name: 'matmul_8x2048_2048xi32__dispatch_0_matmul_8x2048x2048_i32:MLIRAIEV1'
loc("/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8x2048_2048xi32__dispatch_0_amdaie_xclbin_fb/module_matmul_8x2048_2048xi32__dispatch_0_amdaie_xclbin_fb.aiecc.mlir":1:1): error: 'builtin.module' op failed to execute xclbinutil
loc("/proj/rdi/staff/jornt/versal/iree-amd-aie/tests/matmul/module_matmul_8x2048_2048xi32__dispatch_0_amdaie_xclbin_fb/module_matmul_8x2048_2048xi32__dispatch_0_amdaie_xclbin_fb.aiecc.mlir":1:1): error: 'builtin.module' op Failed to generate XCLBin
erwei-xilinx commented 8 months ago

Update: I see a similar error for 8x2048x2048:

Hmm I was unable to reproduce the error. I could get the iree-amd-aie + peano to generate the xclbin using the following workflow:

${IREE_COMPILE}  --iree-hal-target-backends=amd-aie  \
 ./input.mlir \
  --iree-amdaie-use-pipeline=pad-pack \
  --iree-amd-aie-peano-install-dir=/scratch/erweiw/peano/build/install \
  --iree-amd-aie-mlir-aie-install-dir=/scratch/erweiw/mlir-aie-standalone/mlir-aie/install \
  --iree-amd-aie-vitis-install-dir=/proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2 \
  --iree-hal-dump-executable-files-to=./iree-proj  --iree-amd-aie-show-invoked-commands

Could you please provide more information?

jtuyls commented 8 months ago

Update: I see a similar error for 8x2048x2048:

Hmm I was unable to reproduce the error. I could get the iree-amd-aie + peano to generate the xclbin using the following workflow:

${IREE_COMPILE}  --iree-hal-target-backends=amd-aie  \
 ./input.mlir \
  --iree-amdaie-use-pipeline=pad-pack \
  --iree-amd-aie-peano-install-dir=/scratch/erweiw/peano/build/install \
  --iree-amd-aie-mlir-aie-install-dir=/scratch/erweiw/mlir-aie-standalone/mlir-aie/install \
  --iree-amd-aie-vitis-install-dir=/proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2 \
  --iree-hal-dump-executable-files-to=./iree-proj  --iree-amd-aie-show-invoked-commands

Could you please provide more information?

Ok, thanks for confirming that you don't see this. I will first try to debug this on my end as it's probably a local issue then.

jtuyls commented 8 months ago

Update: I see a similar error for 8x2048x2048:

Hmm I was unable to reproduce the error. I could get the iree-amd-aie + peano to generate the xclbin using the following workflow:

${IREE_COMPILE}  --iree-hal-target-backends=amd-aie  \
 ./input.mlir \
  --iree-amdaie-use-pipeline=pad-pack \
  --iree-amd-aie-peano-install-dir=/scratch/erweiw/peano/build/install \
  --iree-amd-aie-mlir-aie-install-dir=/scratch/erweiw/mlir-aie-standalone/mlir-aie/install \
  --iree-amd-aie-vitis-install-dir=/proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2 \
  --iree-hal-dump-executable-files-to=./iree-proj  --iree-amd-aie-show-invoked-commands

Could you please provide more information?

Ok, thanks for confirming that you don't see this. I will first try to debug this on my end as it's probably a local issue then.

This was caused by the kernel name being larger in the tests I was running with larger shapes just overflowing the max size enforced by the xclbinutil tool. I created a PR to work around this issue: https://github.com/nod-ai/iree-amd-aie/pull/188

erwei-xilinx commented 8 months ago

This was caused by the kernel name being larger in the tests I was running with larger shapes just overflowing the max size enforced by the xclbinutil tool. I created a PR to work around this issue: https://github.com/nod-ai/iree-amd-aie/pull/188

Thanks for looking into that. I think my build didn't run into this only because I was using an older version of mlir-aie.

yzhang93 commented 7 months ago

Now both tests failed because of the error

<unknown>:0: error: 'aiex.ipu.dma_memcpy_nd' op Stride 2 exceeds the [1:1M] range.
<unknown>:0: note: see current operation: "aiex.ipu.dma_memcpy_nd"(%arg1) <{id = 1 : i64, metadata = @airMemcpyId10, operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 4, 4, 512, 16>, static_strides = array<i64: 16, 12869632, 25136>, x = 0 : i64, y = 0 : i64}> : (memref<51478528xi32>) -> ()

This failure requires some optimization on the BD's wrap-and-stride list.

Want to loop in @nirvedhmeshram to see if this is something he can help with. Otherwise, we'll wait for Erwei or someone else to take care of this issue.

nirvedhmeshram commented 7 months ago

Now both tests failed because of the error

<unknown>:0: error: 'aiex.ipu.dma_memcpy_nd' op Stride 2 exceeds the [1:1M] range.
<unknown>:0: note: see current operation: "aiex.ipu.dma_memcpy_nd"(%arg1) <{id = 1 : i64, metadata = @airMemcpyId10, operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 4, 4, 512, 16>, static_strides = array<i64: 16, 12869632, 25136>, x = 0 : i64, y = 0 : i64}> : (memref<51478528xi32>) -> ()

This failure requires some optimization on the BD's wrap-and-stride list.

Want to loop in @nirvedhmeshram to see if this is something he can help with. Otherwise, we'll wait for Erwei or someone else to take care of this issue.

I think this has to be handled in air-to-aie lowering, so lets first see if @erwei-xilinx has some ideas on what to do for this case? Happy to help if there is a task there that I can help with.

nirvedhmeshram commented 7 months ago

@erwei-xilinx for the 2048x2048x8 I am seeing failure in the air-loop-fusion pass, can you please take a look at the IR below? https://gist.github.com/nirvedhmeshram/1e0eea3c1c34f7eb9a5a8c16c5f41eda

erwei-xilinx commented 7 months ago

@erwei-xilinx for the 2048x2048x8 I am seeing failure in the air-loop-fusion pass, can you please take a look at the IR below? https://gist.github.com/nirvedhmeshram/1e0eea3c1c34f7eb9a5a8c16c5f41eda

OK. I'm in the process of rewriting part of the air-loop-fusion pass. Will keep you updated on that.

SamuelBayliss commented 7 months ago

Comment from meeting - Planned way forward.

  1. Add intuitive error message, run broken layer on CPU ( where strides are too big)
  2. Prototype alternative tiling choice ( in Iron... show we can do skinny matrices with huge strides, possibly with submission of multiple BDs with different base addresses).
  3. Work on core mechanisms that select alternative tensor layout ( using physical layout in memref) and logical BD generation ( lowered to sequence of multiple physical BDs) to support any tiling pattern.
nirvedhmeshram commented 7 months ago

@erwei-xilinx Here is the other failure that I see with hanging in airrt-to-ipu https://gist.github.com/nirvedhmeshram/613c199575935ca3e4c7195b898d3094

erwei-xilinx commented 7 months ago

@erwei-xilinx for the 2048x2048x8 I am seeing failure in the air-loop-fusion pass, can you please take a look at the IR below? https://gist.github.com/nirvedhmeshram/1e0eea3c1c34f7eb9a5a8c16c5f41eda

Ok I have created two PRs in AIR which will get this IR to successfully lower through AIR. After they land, plus the new memtile buffer allocator pass AIRSplitL2MemrefForBufferConstraint, this size should work e2e. (https://github.com/Xilinx/mlir-air/pull/517 and https://github.com/Xilinx/mlir-air/pull/518)

Here's my AIR-only dev env to reproduce the flow: https://gist.github.com/erwei-xilinx/44c109a2e6106943fb89b64a219ff6cc

erwei-xilinx commented 7 months ago

@erwei-xilinx Here is the other failure that I see with hanging in airrt-to-ipu https://gist.github.com/nirvedhmeshram/613c199575935ca3e4c7195b898d3094

With this PR this shape (8192x2048x2048) can lower through the pad-pack pipeline using 4x4 AIE tiles, and it works for both i32 and bf16 (i32 takes longer to compile, as i32 leads to longer IPU sequence).