nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
64 stars 29 forks source link

Add DMA loop iteration subsumption #495

Closed jtuyls closed 2 months ago

jtuyls commented 3 months ago

DMA loop iteration subsumption tries to move scf.for loops inside the DMA operations by updating the DMA access patterns and hoisting them out of the loop. There are a couple of reasons for needing this:

  1. As the backend compiler (MLIR-AIE) currently doesn't support reprogramming of DMAs on memtiles and cores, we need to avoid creating control code instructions targeting DMAs on memtiles and cores for moving data back-and-forth between L2 and L1.
  2. As control code instructions executed by the Ucontroller to program DMAs are quite expensive, we need to try doing as few as possible. Subsuming the loops into the DMA access patterns significantly reduces the number of control code instructions.

Example input IR for this transformation:

func.func @subsume_iteration_into_dma_and_hoist(%arg0: memref<1x1x8x16xi32, 2>, %arg1: memref<8x16xi32, 1>) {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %c8 = arith.constant 8 : index
  %0 = amdaie.logicalobjectfifo.from_memref %arg0, {} : memref<1x1x8x16xi32, 2> -> !amdaie.logicalobjectfifo<memref<1x1x8x16xi32, 2>>
  %1 = amdaie.logicalobjectfifo.from_memref %arg1, {} : memref<8x16xi32, 1> -> !amdaie.logicalobjectfifo<memref<8x16xi32, 1>>
  scf.for %arg2 = %c0 to %c8 step %c1  {  
    %apply = affine.apply affine_map<(d0) -> (d0 * 32)>(%arg2)
    %2 = amdaie.circular_dma_cpy_nd(%0[] [] [], %1[0, %apply, 0, 0] [1, 1, 8, 8] [128, 128, 16, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<8x16xi32, 1>>)
  }
  return
}

Expected output IR:

func.func @subsume_iteration_into_dma_and_hoist(%arg0: memref<1x1x8x16xi32, 2>, %arg1: memref<8x16xi32, 1>) {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %c8 = arith.constant 8 : index
  %0 = amdaie.logicalobjectfifo.from_memref %arg0, {} : memref<1x1x8x16xi32, 2> -> !amdaie.logicalobjectfifo<memref<1x1x8x16xi32, 2>>
  %1 = amdaie.logicalobjectfifo.from_memref %arg1, {} : memref<8x16xi32, 1> -> !amdaie.logicalobjectfifo<memref<8x16xi32, 1>>
  %2 = amdaie.circular_dma_cpy_nd(%0[] [] [], %1[0, 0, 0, 0] [1, 8, 8, 8] [256, 32, 16, 1]) : (!amdaie.logicalobjectfifo<memref<1x1x8x16xi32, 2>>, !amdaie.logicalobjectfifo<memref<8x16xi32, 1>>)
  return
}

This results in a single programming of the DMAs needed to implement this amdaie.circular_dma_cpy_nd operation and no control code is needed on the uController side to update the DMAs.

jtuyls commented 2 months ago

Added initial support with: https://github.com/nod-ai/iree-amd-aie/pull/512, but this PR doesn't add the new DMA loop subsumption pass to the objectFifo lowering pipeline as making this work in E2E needs some changes to how BD IDs are assigned to amdaie.npu.dma_memcpy_nd operations. Earlier, all operations would use id == 0, which is only valid if the control code is executed synchronously 'operation by operation' (every DMA operation was directly followed by a wait). As this is not the case anymore, some new logic is needed to assign ids to the amdaie.npu.dma_memcpy_nd operations. This will be addressed in a follow-up PR.

jtuyls commented 2 months ago

Done with:

  1. https://github.com/nod-ai/iree-amd-aie/pull/512 to add initial support, but incorrect behaviour in E2E due to incorrect BD IDs.
  2. https://github.com/nod-ai/iree-amd-aie/pull/551 to generate correct BD IDs