Open yzhang93 opened 2 months ago
For the first point, I didn't see significant performance change after changing the single buffer to double buffer. However, the performance increases significantly if the L1/L2 sizes are increased (has to use the single buffer to avoid exceeding the memory bound).
Here are some comparison results on the matmul shapes from VAE. The execution time is the average of 10 runs.
Current parameter settings: L2 depth = 2, L1 depth = 2 L2 tile size = 64, L1 tile size = 32 | Dispatch Type | Shape | dtype | Compilation Time [ms] | Execution Time [ms] (Phoenix) |
---|---|---|---|---|---|
matmul | 256x65536x512 | bf16 | 18799 | 1268.7 | |
matmul | 128x262144x256 | bf16 | 20865 | 1668.5 | |
matmul_transpose_b | 4096x512x512 | bf16 | 1917 | 153.7 |
Now use single buffer: L2 depth = 1, L1 depth = 1 L2 tile size = 64, L1 tile size = 32 | Dispatch Type | Shape | dtype | Compilation Time [ms] | Execution Time [ms] (Phoenix) |
---|---|---|---|---|---|
matmul | 256x65536x512 | bf16 | 17863 | 1269.5 | |
matmul | 128x262144x256 | bf16 | 19951 | 1669.8 | |
matmul_transpose_b | 4096x512x512 | bf16 | 1260 | 169 |
Now increase tile sizes: L2 depth = 1, L1 depth = 1 L2 tile size = 128, L1 tile size = 64 | Dispatch Type | Shape | dtype | Compilation Time [ms] | Execution Time [ms] (Phoenix) |
---|---|---|---|---|---|
matmul | 256x65536x512 | bf16 | 4624 | 765.5 | |
matmul | 128x262144x256 | bf16 | 3080 | 1198.8 | |
matmul_transpose_b | 4096x512x512 | bf16 | 975 | 148 |
To add more details on 2), see for example this piece of control code for a 128x128x128 matmul after the DmaComposition
pass:
scf.forall (%arg0, %arg1) in (2, 2) {
%41 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%42 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
...
%45 = amdaie.npu.circular_dma_cpy_nd %8([0] [2048] [1], [] [] [])
%46 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, 0, %41] [4, 2, 32, 32] [4096, 32, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
amdaie.npu.dma_wait(%46, MM2S)
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
Here %46
has 4 dimensions on the source side and as this is the limit, the loop iteration (see dependency through %41
) can't be subsumed into the DMA's source dimensions anymore. However, some of the dimensions on the source side could potentially be moved to the target side (which currently has a linear write access pattern as can be seen in %45
). This would typically result in a larger read by the source DMA port and the target DMA port would then take care of writing the result in the expected blocked format, with resulting IR:
scf.forall (%arg0, %arg1) in (2, 2) {
%41 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%42 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
...
%45 = amdaie.npu.circular_dma_cpy_nd %8([0, 0, 0, 0] [4, 32, 2, 32] [2048, 32, 1024, 1], [] [] [])
%46 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, %41] [4, 32, 64] [4096, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
amdaie.npu.dma_wait(%46, MM2S)
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
Or after canonicalization:
scf.forall (%arg0, %arg1) in (2, 2) {
%41 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
%42 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
...
%45 = amdaie.npu.circular_dma_cpy_nd %8([0, 0, 0, 0] [4, 32, 2, 32] [2048, 32, 1024, 1], [] [] [])
%46 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %41] [128, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
amdaie.npu.dma_wait(%46, MM2S)
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
After this transformation, the source access pattern is left with only 2 dimensions, so now the DmaLoopSubsumption
transformation can be applied again to reduce the number of NPU instructions:
%45 = amdaie.npu.circular_dma_cpy_nd %8([0, 0, 0, 0] [4, 32, 2, 32] [2048, 32, 1024, 1], [] [] [])
%46 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, 0] [2, 128, 64] [64, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
scf.forall (%arg0, %arg1) in (2, 2) {
...
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
amdaie.npu.dma_wait(%46, MM2S)
Optimizing 2 would also reduce compile time. For the larger matmul above the pass AMDAIEControlCodeLoopUnroll
creates O(1e5) operations, and everything thereafter is slow (canonicalization takes O(1) seconds).
I have point 2) optimized and working correctly for most shapes. However, the tests with large k size (>=1024) have numerics issue. Here's a simplified version of codes (with just L3 to L2 dma addressing change) I made for testing purpose https://github.com/nod-ai/iree-amd-aie/pull/809.
Note if I disable the second LoopSubsumptionPass(/DmaComposition), then all the tests pass, which means the changes within convert-to-dma
work without problem. The problem seems to happen in LoopSubsumptionPass (maybe the changes I made to relax the npu.circular_dma constraint)?
Here's the IR dump for 128x128x256 (worked) and 128x128x1024 (failed) for comparison.
@jtuyls do you have any idea about this?
UPDATE: This is currently solved by not subsuming loop iterations for large K size (>=1024) since it would exceed the size limit after inserting new dimensions.
This issue is used as a tracker for ideas and discussions to improve performance for matmul ops. The data type for all these matmuls is bf16.
Some existing ideas include:
@jtuyls Feel free to add more points and details.