Open nirvedhmeshram opened 5 months ago
I did some profiling on what we get by having a XCLBIN with multiple PDIs. My test example had three kernels and I measured the loading time in the following setups
Here are the times I saw for small GEMM shapes (M,N,K = 32,32,32) and large GEMM shapes (M,N.K = 1024,1024, 32)
Setup | Small | Large |
---|---|---|
1 | 19.82 ms | 21.66 ms |
2 | 24.11 ms | 22.37 ms |
3 | 19.88 ms | 21.53 ms |
I think it can be concluded that the main bottleneck is getting the kernel from the xclbin itself and merging to one xcdlbin isnt changing the time taken to load the kernels much.
@kumardeepakamd please see the findings above. Looks like we need to do some different pdi loads to actually benefit from xclbin merging?
This issue is an amd-aie backend counterpart of upstream issue https://github.com/iree-org/iree/issues/7824
We will initially follow the SPIR-V/vulkan commit sequence below
Once this is done the state will be that we will have a XCLBIN archive that contains multiple XCLBINs per executable.
Then in the last stage we will use the utility provided here, https://github.com/Xilinx/mlir-aie/pull/1508 to merge the XCLBIN's and achieve the desired multi-kernel XCLBIN