nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
45 stars 23 forks source link

Problem with python bindings and AIE xclbin #467

Open josemonsalve2 opened 6 days ago

josemonsalve2 commented 6 days ago

Hi all,

I am having some issues with the compiler Python bindings and AIE support.

In my setup, I have the XRT and XDNA driver built in debug mode.

For the following code:

!A_TYPE = tensor<64x64xbf16>
!B_TYPE = tensor<64x64xbf16>
!C_TYPE = tensor<64x64xf32>
func.func @matmul_small_1(%lhs : !A_TYPE,
    %rhs : !B_TYPE) -> !C_TYPE {
  %empty = tensor.empty() : !C_TYPE
  %cst = arith.constant 0.0 : f32
  %fill = linalg.fill ins(%cst : f32) outs(%empty : !C_TYPE) -> !C_TYPE
  %2 = linalg.matmul ins(%lhs, %rhs : !A_TYPE, !B_TYPE)
      outs(%fill : !C_TYPE) -> !C_TYPE
  return %2 : !C_TYPE
}

I can build it with the following command:

iree-compile --mlir-disable-threading --iree-hal-executable-debug-level=3 --iree-hal-target-backends=amd-aie test.mlir --iree-amd-aie-mlir-aie-install-dir=/opt/iree/deps/mlir-aie/my_insta
ll/mlir_aie --iree-amd-aie-peano-install-dir=/opt/iree/deps/ -o test.vmfb 

And I get this output:

> iree-compile --mlir-disable-threading --iree-hal-executable-debug-level=3 --iree-hal-target-backends=amd-aie test.mlir --iree-amd-aie-mlir-aie-install-dir=/opt/iree/deps/mlir-aie/my_insta
ll/mlir_aie --iree-amd-aie-peano-install-dir=/opt/iree/deps/ -o test.vmfb 
Generating: /tmp/amdaie_xclbin_fb-17e446/aie_cdo_elfs.bin
Generating: /tmp/amdaie_xclbin_fb-17e446/aie_cdo_init.bin
Generating: /tmp/amdaie_xclbin_fb-17e446/aie_cdo_enable.bin

****** Bootgen v2024.1
  **** Build date : Jun 18 2024-22:04:45
    ** Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.
    ** Copyright 2022-2024 Advanced Micro Devices, Inc. All Rights Reserved.

[INFO]   : Bootimage generated successfully

XRT Build Version: 2.18.0 (master)
       Build Date: 2024-06-24 16:35:34
          Hash ID: e1a296a6b8769204088d16867c1553821835c272
Creating a default 'in-memory' xclbin image.

Section: 'MEM_TOPOLOGY'(6) was successfully added.
Size   : 88 bytes
Format : JSON
File   : '/tmp/amdaie_xclbin_fb-17e446/mem_topology.json'

Section: 'AIE_PARTITION'(32) was successfully added.
Size   : 4624 bytes
Format : JSON
File   : '/tmp/amdaie_xclbin_fb-17e446/aie_partition.json'
Info: Embedded Metadata section is missing project.platform.device.core element, adding it.
Successfully wrote (11080 bytes) to the output file: /tmp/amdaie_xclbin_fb-17e446/matmul_small_1_dispatch_0_matmul_64x64x64__0.xclbin
Leaving xclbinutil.

The problem seems to be the output from the binary utilities for the AIE. When using the Python bindings, this text is prepended into the compiler's output.

Let's take an example of a python code:

import iree.runtime as rt
import iree.compiler as cp

import numpy as np

## This is the code that will be compiled

AIE_MATMUL_ASM = """
!A_TYPE = tensor<64x64xbf16>
!B_TYPE = tensor<64x64xbf16>
!C_TYPE = tensor<64x64xf32>
func.func @matmul_small_1(%lhs : !A_TYPE,
    %rhs : !B_TYPE) -> !C_TYPE {
  %empty = tensor.empty() : !C_TYPE
  %cst = arith.constant 0.0 : f32
  %fill = linalg.fill ins(%cst : f32) outs(%empty : !C_TYPE) -> !C_TYPE
  %2 = linalg.matmul ins(%lhs, %rhs : !A_TYPE, !B_TYPE)
      outs(%fill : !C_TYPE) -> !C_TYPE
  return %2 : !C_TYPE
}
"""

def compile_for_AIE():
    """
    Compile code for the AMD GPU via the `rocm` backend.
    Device flag used is for the Phoenix machine:
    --iree-rocm-target-chip=gfx1103. We must also specify the
    location for llvm-aie (Peano), and mlir-aie
    """

    binary = cp.tools.compile_str(
        AIE_MATMUL_ASM,
        target_backends=["amd-aie"],
        extra_args=[
            "--iree-amd-aie-mlir-aie-install-dir=/opt/iree/deps/mlir-aie/my_install/mlir_aie",
            "--iree-amd-aie-peano-install-dir=/opt/iree/deps/",
        ],
    )
    return binary

def execution():
    """
    Execute code in both devices
    """

    aie_code = compile_for_AIE()
    config = rt.Config("xrt")
    ctx = rt.SystemContext(config=config)
    vm_module = rt.VmModule.copy_buffer(ctx.instance, aie_code)
    # ctx.add_vm_module(vm_module)
    # arg0 = np.ones((64,64), dtype=np.float16)
    # arg1 = np.ones((64,64), dtype=np.float16)
    # f = ctx.modules.module['matmul_small_1']
    # result_aie = f(arg0, arg1).to_host()

execution()

I get the following error:

> python3 test.py
50: PID(350954): Created KMQ pcidev
119277458: PID(350954): Device opened, fd=3
119363449: PID(350954): Allocated KMQ BO (userptr=0x7f0dec000000, size=50331648, flags=0x0, type=2, drm_bo=1)
119375232: PID(350954): Created KMQ device (0000:c5:00.1) ...
Traceback (most recent call last):
  File "/home/jmonsalv/tmp/iree/test.py", line 60, in <module>
    execution()
  File "/home/jmonsalv/tmp/iree/test.py", line 52, in execution
    vm_module = rt.VmModule.copy_buffer(ctx.instance, aie_code)
ValueError: Error creating vm module from aligned memory: iree/runtime/src/iree/vm/bytecode/archive.c:122: INVALID_ARGUMENT; FlatBuffer length prefix out of bounds (prefix is 707398154 but only 17609 available)
196583671: PID(350954): Destroying KMQ pcidev
196613928: PID(350954): Device node fd leaked!! fd=3

If I try to print the output of the compilation, I get the following: Python code:

def compile_for_AIE():
    """
    Compile code for the AMD GPU via the `rocm` backend.
    Device flag used is for the Phoenix machine:
    --iree-rocm-target-chip=gfx1103. We must also specify the
    location for llvm-aie (Peano), and mlir-aie
    """
    text = cp.tools.compile_str(
        AIE_MATMUL_ASM,
        target_backends=["amd-aie"],
        output_format=cp.tools.OutputFormat.MLIR_TEXT,
        extra_args=[
            "--iree-amd-aie-mlir-aie-install-dir=/opt/iree/deps/mlir-aie/my_install/mlir_aie",
            "--iree-amd-aie-peano-install-dir=/opt/iree/deps/",
        ],
    )

    print(text.decode("utf-8"))

    binary = cp.tools.compile_str(
        AIE_MATMUL_ASM,
        target_backends=["amd-aie"],
        extra_args=[
            "--iree-amd-aie-mlir-aie-install-dir=/opt/iree/deps/mlir-aie/my_install/mlir_aie",
            "--iree-amd-aie-peano-install-dir=/opt/iree/deps/",
        ],
    )
    print(binary)
    return binary

The text output [trunkated]:

> python3 test.py

****** Bootgen v2024.1
  **** Build date : Jun 18 2024-22:04:45
    ** Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.
    ** Copyright 2022-2024 Advanced Micro Devices, Inc. All Rights Reserved.

[INFO]   : Bootimage generated successfully

XRT Build Version: 2.18.0 (master)
       Build Date: 2024-06-24 16:35:34
          Hash ID: e1a296a6b8769204088d16867c1553821835c272
Creating a default 'in-memory' xclbin image.

Section: 'MEM_TOPOLOGY'(6) was successfully added.
Size   : 88 bytes
Format : JSON
File   : '/tmp/amdaie_xclbin_fb-48e170/mem_topology.json'

Section: 'AIE_PARTITION'(32) was successfully added.
Size   : 4624 bytes
Format : JSON
File   : '/tmp/amdaie_xclbin_fb-48e170/aie_partition.json'
Info: Embedded Metadata section is missing project.platform.device.core element, adding it.
Successfully wrote (11080 bytes) to the output file: /tmp/amdaie_xclbin_fb-48e170/matmul_small_1_dispatch_0_matmul_64x64x64__0.xclbin
Leaving xclbinutil.
Generating: /tmp/amdaie_xclbin_fb-48e170/aie_cdo_elfs.bin
Generating: /tmp/amdaie_xclbin_fb-48e170/aie_cdo_init.bin
Generating: /tmp/amdaie_xclbin_fb-48e170/aie_cdo_enable.bin
vm.module public @module attributes {ordinal_counts = #vm.ordinal_counts<import_funcs = 19, export_funcs = 2, internal_funcs = 2, global_bytes = 4, global_refs = 2, rodatas = 6, rwdatas = 0>} {
  vm.global.i32 private mutable @_device_query_0 {ordinal = 0 : i32} : i32 loc(callsite("<stdin>":10:8 at "<stdin>":5:1))
...

The binary output [Trunkated]:

b'\n\n****** Bootgen v2024.1\n  **** Build date : Jun 18 2024-22:04:45\n    ** Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.\n    ** Copyright 2022-2024 Advanced Micro Devices, Inc. All Rights Reserved.\n\n\n[INFO]   : Bootimage generated successfully\n\nXRT Build Version: 2.18.0 (master)\n       Build Date: 2024-06-24 16:35:34\n          Hash ID: e1a296a6b8769204088d16867c1553821835c272\nCreating a default \'in-memory\' xclbin image.\n\nSection: \'MEM_TOPOLOGY\'(6) was successfully added.\nSize   : 88 bytes\nFormat : JSON\nFile   : \'/tmp/amdaie_xclbin_fb-d75b5d/mem_topology.json\'\n\nSection: \'AIE_PARTITION\'(32) was successfully added.\nSize   : 4624 bytes\nFormat : JSON\nFile   : \'/tmp/amdaie_xclbin_fb-d75b5d/aie_partition.json\'\nInfo: Embedded Metadata section is missing project.platform.device.core element, adding it.\nSuccessfully wrote (11080 bytes) to the output file: /tmp/amdaie_xclbin_fb-d75b5d/matmul_small_1_dispatch_0_matmul_64x64x64__0.xclbin\nLeaving xclbinutil.\nGenerating: /tmp/amdaie_xclbin_fb-d75b5d/aie_cdo_elfs.bin\nGenerating: /tmp/amdaie_xclbin_fb-d75b5d/aie_cdo_init.bin\nGenerating: /tmp/amdaie_xclbin_fb-d75b5d/aie_cdo_enable.bin\nPK\x03\x04-\x00\x00\
...

You can notice the text at the beginning:

\n\n****** Bootgen v2024.1\n  **** Build date : Jun 18 2024-22:04:45...

Which causes the corruption of the module in the copy_buffer function.

A current trick I use is to search for PK\x03\04 and remove all the previous content. That seems to work.

def execution():
    """
    Execute code in both devices
    """

    aie_code = compile_for_AIE()
    clear = aie_code.find(b"PK\x03\x04")
    if clear != -1:
        aie_code = aie_code[clear:]

    config = rt.Config("xrt")
    ctx = rt.SystemContext(config=config)
    vm_module = rt.VmModule.copy_buffer(ctx.instance, aie_code)
    # ctx.add_vm_module(vm_module)
    # arg0 = np.ones((64,64), dtype=np.float16)
    # arg1 = np.ones((64,64), dtype=np.float16)
    # f = ctx.modules.module['matmul_small_1']
    # result_aie = f(arg0, arg1).to_host()

But this is not a propper solution.

newling commented 6 days ago

Makes sense, thanks for the clear description. I can confirm that when I run

iree-compile --mlir-disable-threading --iree-hal-target-backends=amd-aie test.mlir --iree-amd-aie-mlir-aie-install-dir=dir0 --iree-amd-aie-peano-install-dir=dir1 --iree-amd-aie-vitis-install-dir=dir2 -o test.vmfb

I see

Generating: /tmp/amdaie_xclbin_fb-085db8/aie_cdo_elfs.bin
Generating: /tmp/amdaie_xclbin_fb-085db8/aie_cdo_init.bin
Generating: /tmp/amdaie_xclbin_fb-085db8/aie_cdo_enable.bin

****** Bootgen v2024.1
  **** Build date : Jun 14 2024-13:59:59
    ** Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.
    ** Copyright 2022-2024 Advanced Micro Devices, Inc. All Rights Reserved.

[INFO]   : Bootimage generated successfully

XRT Build Version: 2.18.22 (master)
       Build Date: 2024-05-21 20:14:13
          Hash ID: c678a9469f9b20fcb9a04bbedb5c51f8473faec0
Creating a default 'in-memory' xclbin image.

Section: 'MEM_TOPOLOGY'(6) was successfully added.
Size   : 88 bytes
Format : JSON
File   : '/tmp/amdaie_xclbin_fb-085db8/mem_topology.json'

Section: 'AIE_PARTITION'(32) was successfully added.
Size   : 76592 bytes
Format : JSON
File   : '/tmp/amdaie_xclbin_fb-085db8/aie_partition.json'
Info: Embedded Metadata section is missing project.platform.device.core element, adding it.
Successfully wrote (82706 bytes) to the output file: /tmp/amdaie_xclbin_fb-085db8/matmul_int32_dispatch_0_matmul_128x128x256_0.xclbin
Leaving xclbinutil.

And that logging should really be off by default (I'm not even sure if there's an option to remove it). I'll investigate this

I suppose that's a partial solution (better than your current solution!) but I also wonder if there's a better fix at the python level, to separate any logging which might appear, from the vmfb. i.e. an equivalent of the "-o test.vmfb" we use at the command-line. I'm not sure if this already exists (I haven't used the python API) --maybe a good question for the public discord IREE channel.

josemonsalve2 commented 6 days ago

Yes, I felt this was more of an IREE binding bug, than a AIE bug. But since it is hard to reproduce outside of this right now, I might as well start here.

newling commented 6 days ago

Removed some of the printing here: https://github.com/Xilinx/mlir-aie/pull/1583/files and here: https://github.com/Xilinx/bootgen/pull/34

So that should make it nice and quiet on the iree-compile side.

No idea how long it will take for the above PRs to filter through to us though