nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
46 stars 23 forks source link

Implement multi-kernel XCLBIN #380

Open nirvedhmeshram opened 1 month ago

nirvedhmeshram commented 1 month ago

This issue is an amd-aie backend counterpart of upstream issue https://github.com/iree-org/iree/issues/7824

We will initially follow the SPIR-V/vulkan commit sequence below

  1. https://github.com/iree-org/iree/pull/15785
  2. https://github.com/iree-org/iree/pull/15788 (bug fix vibes will merge with 1)
  3. https://github.com/iree-org/iree/pull/15789
  4. https://github.com/iree-org/iree/pull/15802

Once this is done the state will be that we will have a XCLBIN archive that contains multiple XCLBINs per executable.

Then in the last stage we will use the utility provided here, https://github.com/Xilinx/mlir-aie/pull/1508 to merge the XCLBIN's and achieve the desired multi-kernel XCLBIN

nirvedhmeshram commented 1 week ago

I did some profiling on what we get by having a XCLBIN with multiple PDIs. My test example had three kernels and I measured the loading time in the following setups

  1. Make a new xclbin for each kernel and load the xclbin and then the corresponding kernel from it.
  2. Make one xclbin for all three kernels but load it three times, once for each kernel.
  3. Make one xclbin, load it only once and get the three kernels from it.

Here are the times I saw for small GEMM shapes (M,N,K = 32,32,32) and large GEMM shapes (M,N.K = 1024,1024, 32)

Setup Small Large
1 19.82 ms 21.66 ms
2 24.11 ms 22.37 ms
3 19.88 ms 21.53 ms

I think it can be concluded that the main bottleneck is getting the kernel from the xclbin itself and merging to one xcdlbin isnt changing the time taken to load the kernels much.

nirvedhmeshram commented 1 week ago

@kumardeepakamd please see the findings above. Looks like we need to do some different pdi loads to actually benefit from xclbin merging?