nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
66 stars 29 forks source link

Use direct lowering for pack/unpack operations to AIE DMA operation representation #148

Open powderluv opened 9 months ago

powderluv commented 9 months ago

Currently the iree_linalg_ext.pack operations which map well to the aie dma operations are decomposed to air.dma_memcpy_nd which doesnt seem to model it as well since it doesnt have a concept of wrap or inner tile. We should add a new op in air dialect that maps one to one to iree_linalg_ext.pack-> air.dma_op -> aie.dma_op

nirvedhmeshram commented 9 months ago

@erwei-xilinx , please let me know what your thoughts are on this approach.

erwei-xilinx commented 9 months ago

@powderluv @nirvedhmeshram I will need more context about this issue. air.dma_memcpy_nd op does have fields which correspond to wraps: the op is essentially a combination one memcpy op plus two subviews (one for src memref and one for dst). Each subview in the op comes with a sizes field which is essentially wraps. The subview also comes with strides and offsets to represent an inner tile.

Currently we do have a workflow which can perform one-to-one mapping between iree_linalg_ext.pack/unpack to air.dma_memcpy_nd. This workflow consists of two passes, where the first one called iree-amdaie-decompose-pack-unpack-to-air which currently is under iree-amd-aie repo; the second one called air-copy-to-dma which is under mlir-air.

From air.memcpy_nd_op down it is more complicated. aie.dma doesn't exist, and mlir-aie dialect represents data movers using DMA BD programs which map to physical AIE DMAs. The pass -air-to-aie in mlir-air currently handles the lowering from logical air.dma_memcpy_nd to physical AIE DMA BDs.

erwei-xilinx commented 9 months ago

air.dma_memcpy_nd, as a logical representation of data movement, is currently agnostic wrt AIE architecture versions: the op is compatible with both AIE1 and AIE2 architectures. Architecture specific hw constraints are enforced in the downstream passes which lowers the logical data movement to physical resources.

We can squash the whole process and put the pack-to-aie flow all within one single pass, but it may not be one to one, if the later is one single DMA BD, because due to AIE2 BD constraints (wrap cannot go beyond [0, 1023], # dims cannot go beyond 4 for shim/memtile and 3 for tile, only the highest dim wrap can do repeat, etc), one logical pack/unpack data movement may need to lower to a finite-state-machine consisting of multiple BDs.

MaheshRavishankar commented 9 months ago

This makes sense... I think the main idea is to fold the logic of iree-amdaie-decompose-pack-unpack-to-air and what happens in air-copy-to-dma operation. They shouldnt need to be split into two parts. @nirvedhmeshram , you and I maybe need to find some time to look at IR and go a bit deep on this issue.

erwei-xilinx commented 9 months ago

I think unifying the iree-amdaie-decompose-pack-unpack-to-air and air-copy-to-dma makes sense to me, now that MLIR-AIR is integrating into IREE. The original rationale behind this separation was that MLIR-AIR could use the memref.transpose/subview/expand_shape ops as frontend in addition to iree_linalg_ext.pack/unpack.

Here are some unit tests on iree-amd-aie side for iree-amdaie-decompose-pack-unpack-to-air, and here are some on mlir-air side for air-copy-to-dma.

nirvedhmeshram commented 8 months ago

@erwei-xilinx could you please explain the func1 in the decompose-pack lowering here I am wondering why we needed an expand shape and not keep the same buffer dimensions from the original pack op? Secondly, when I try to apply the condensation pattern from air-copy-to-dma (note I didnt apply the entire pass) on the resulting IR, I am getting a crash.

erwei-xilinx commented 8 months ago

@nirvedhmeshram The memref.expand_shape op expands the shape of the memref to increase the rank of the memref from 4 to 6, then the memref.transpose op performs a permutation, exchanging dim 2 with 4. I don't think memref.transpose can permute a rank 4 memref into a rank 6 memref.

The output of that test is expected to fail in -air-copy-to-dma, because it is just a snapshot of a series of memref layout transformation ops, where I deleted the "layout" field of the input memref type to memref.expand_shape for simplicity. If you look at func6, which is the complete GEMM example, and search for memref.expand_shape, then you will see it's original form of that func1 test, and how the memref.expand_shape input inherits its layout from predecesors.

If the chain of memref layout transformation is incomplete, then -air-copy-to-dma cannot successfully fold the memref ops into one memref type. The crash comes from there. We should probably put some error message there...

nirvedhmeshram commented 8 months ago

@erwei-xilinx I was wondering about something in how the offset propagation is done when we rank reduce in the current lowering, lets take this dma operation post condensation https://gist.github.com/nirvedhmeshram/8e7508a34a5d40b55c1a980c6f93496b#file-condensed_dma-mlir-L33

If you compare it from the starting point here https://github.com/nod-ai/iree-amd-aie/blob/main/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/pack_to_air.mlir#L144

Since offset step can be calculated as size x stride In the final dma case the the offset step that will be applied is 2x8=16, but the original pack case the offset step is 1x128=128, so is this technically correct or something our codegen is not getting affected by currently due to these kind of offsets always being 0?

erwei-xilinx commented 8 months ago

@nirvedhmeshram I think at the moment it's the latter if I understand the question correctly. The pack lowering to AIR hasn't been exercised extensively with offsets not being zero. Once lowered to AIR we have some examples showing how offsets are propagated.

nirvedhmeshram commented 8 months ago

@erwei-xilinx thanks for clarifying, also after sending that message I realized I might not be thinking of offsets correctly, are they just multiplied with the stride, to get the starting element , or are they multiplied with the stride AND the size, in either case I think the lowering is not ending up with the correct offset but understanding the semantics will help us do the correct offset lowring in the new pattern.

MaheshRavishankar commented 8 months ago

@erwei-xilinx thanks for clarifying, also after sending that message I realized I might not be thinking of offsets correctly, are they just multiplied with the stride, to get the starting element , or are they multiplied with the stride AND the size, in either case I think the lowering is not ending up with the correct offset but understanding the semantics will help us do the correct offset lowring in the new pattern.

Are we talking about subview or the DMA op?

nirvedhmeshram commented 8 months ago

@erwei-xilinx thanks for clarifying, also after sending that message I realized I might not be thinking of offsets correctly, are they just multiplied with the stride, to get the starting element , or are they multiplied with the stride AND the size, in either case I think the lowering is not ending up with the correct offset but understanding the semantics will help us do the correct offset lowring in the new pattern.

Are we talking about subview or the DMA op?

I think we are agreeing that The DMA op is not offsetting by same number of elements as the sub view op in the pack representation in the two IR's I linked.

nirvedhmeshram commented 8 months ago

@erwei-xilinx thanks for clarifying, also after sending that message I realized I might not be thinking of offsets correctly, are they just multiplied with the stride, to get the starting element , or are they multiplied with the stride AND the size, in either case I think the lowering is not ending up with the correct offset but understanding the semantics will help us do the correct offset lowring in the new pattern.

Are we talking about subview or the DMA op?

I was asking about DMA, but I could not find documentation for sub view either, although I would assume it is just multiplied to the stride in that case.

erwei-xilinx commented 8 months ago

@nirvedhmeshram I think it might be useful to capture any offset errors with a unit test, which helps to show whether pack is not generating the memref.subview correctly, or memref.subview is not generating DMA correctly.

MaheshRavishankar commented 8 months ago

@erwei-xilinx thanks for clarifying, also after sending that message I realized I might not be thinking of offsets correctly, are they just multiplied with the stride, to get the starting element , or are they multiplied with the stride AND the size, in either case I think the lowering is not ending up with the correct offset but understanding the semantics will help us do the correct offset lowring in the new pattern.

Are we talking about subview or the DMA op?

I was asking about DMA, but I could not find documentation for sub view either, although I would assume it is just multiplied to the stride in that case.

It's hard to describe in chat. Lets chat tomorrow sometime about it

nirvedhmeshram commented 8 months ago

I have a pattern for pack to dma here https://github.com/nod-ai/iree-amd-aie/pull/178

nirvedhmeshram commented 8 months ago

Here is how the IR looks like https://gist.github.com/nirvedhmeshram/188a0770133a0e03c40d234776a57080 so the first thing to verify is, if this is the correct lowering, @erwei-xilinx @stephenneuendorffer @MaheshRavishankar

nirvedhmeshram commented 8 months ago

also @jtuyls @yzhang93

jtuyls commented 8 months ago

Hi @nirvedhmeshram , I have a couple of questions:

1) In func0, why do we have [%c128, %c16, %c16, %c1] here on the src strides? I would have expected [%c128, %c128, %c16, %c1]. Because size is 1 for that dimension, it shouldn't matter, but it looks weird to me that 16 is generated instead of 128.

air.dma_memcpy_nd (%alloc[%c0, %c0, %c0, %c0] [%c1, %c1, %c8, %c16] [%c128, %c128, %c16, %c1], %alloc_0[%c0, %c0, %c0, %c0] [%c1, %c1, %c8, %c16] [%c128, %c16, %c16, %c1]) : (memref<1x1x8x16xi32, 1>, memref<8x16xi32, 1>)

2) Right now, the pack operation seems to be completely offloaded to the src side of the air.dma_memcpy_nd and dest is always a linear write? Is this on purpose or for simplicity? This does have some implications, for example exceeding BD dimensionality/running out of BDs more quickly due to limited dimensionality of the BDs. Also, in general there is a choice to be made here with multiple solutions and this can affect performance (maybe a consideration for down the road). A different way of converting pack would be to split the transformation across src and dest, for example by handling the tile part on the src side and the transpose part on the dest side. Not sure whether this does matter as this could potentially be rearranged/optimized down the line as well. Any thoughts from you, @erwei-xilinx ?

nirvedhmeshram commented 8 months ago

Hi @nirvedhmeshram , I have a couple of questions:

  1. In func0, why do we have [%c128, %c16, %c16, %c1] here on the src strides? I would have expected [%c128, %c128, %c16, %c1]. Because size is 1 for that dimension, it shouldn't matter, but it looks weird to me that 16 is generated instead of 128.
air.dma_memcpy_nd (%alloc[%c0, %c0, %c0, %c0] [%c1, %c1, %c8, %c16] [%c128, %c128, %c16, %c1], %alloc_0[%c0, %c0, %c0, %c0] [%c1, %c1, %c8, %c16] [%c128, %c16, %c16, %c1]) : (memref<1x1x8x16xi32, 1>, memref<8x16xi32, 1>)
  1. Right now, the pack operation seems to be completely offloaded to the src side of the air.dma_memcpy_nd and dest is always a linear write? Is this on purpose or for simplicity? This does have some implications, for example exceeding BD dimensionality/running out of BDs more quickly due to limited dimensionality of the BDs. Also, in general there is a choice to be made here with multiple solutions and this can affect performance (maybe a consideration for down the road). A different way of converting pack would be to split the transformation across src and dest, for example by handling the tile part on the src side and the transpose part on the dest side. Not sure whether this does matter as this could potentially be rearranged/optimized down the line as well. Any thoughts from you, @erwei-xilinx ?
  1. Great observation about the strides, it took me some time to digest and definitely want to have a discussion about it. So we are setting the strides for accesing in the packed layout defined by memref<1x1x8x16> and lets call these dimensions <AxBxtiledAxtiledB> here the interesting thing I learned is that the tiledA dimension is still strided as outer to B dimension so the memory layout is iterate A (orignally stride 16, due to tiling stride is = orignal stride x tile size = 16x8 = 128) iterate tiledA (stride = original outer dim(A) stride = 161) iterate B (originally stride 1, due to tiling stride is = orginal stride x tile size =1x16 = 162 iterate tiled B (stride = original outer dim(B) stride = 1)

so new strides for <AxBxtiledAxtiledB> = <128x162x161x1>

  1. I havent yet thought much about this problem but I agree with you that there are hardware implications so its important to explore different code-geneartion methods here. We should be able to do that at the linalg level and enable support down the stack. Lets see what @erwei-xilinx and others think about it.
erwei-xilinx commented 8 months ago

@nirvedhmeshram and @jtuyls Re 2, the current pack lowering in AIR is generating wraps and strides which operate only on sources and not destinations. Although this doesn't mean that the destination-side DMAs are not doing any data rearrangements (tiling generates memref.subviews on destination-side DMA BDs), there could still be extra hw benefits to offloading some data rearrangements on the destination side. AIR currently isn't able to do that.

In AIR, pack and unpack operations are not the only operations that generate wraps and strides to DMA BDs. The memref.subviews, generated from tiling, also could rely on DMA BDs to do data rearrangements. Given that the pack-to-dma lowering is now direct, there may be the need for an additional mechanism to update the wraps and strides if there's an extra memref.subview?

nirvedhmeshram commented 8 months ago

In AIR, pack and unpack operations are not the only operations that generate wraps and strides to DMA BDs. The memref.subviews, generated from tiling, also could rely on DMA BDs to do data rearrangements. Given that the pack-to-dma lowering is now direct, there may be the need for an additional mechanism to update the wraps and strides if there's an extra memref.subview?

If the subview is consumed by a pack/unpack operation the direct lowering should fold it in, if there are other consumers of the sub-view then yes potentially need a new lowering.

erwei-xilinx commented 8 months ago

@nirvedhmeshram I did a quick experiment with a Transform Dialect tiling script, and see that at the moment the memref.subview ops (or tensor.extract_slice) do not get automatically folded into pack. See code snippet below:

                %7 = affine.apply #map1()[%arg2]
                %8 = affine.apply #map2()[%arg4]
                %subview_4 = memref.subview %alloc[%7, %8] [32, 32] [1, 1] : memref<64x2048xi32, 1> to memref<32x32xi32, strided<[2048, 1], offset: ?>, 1>
                %alloc_5 = memref.alloc() : memref<4x8x4x8xi32, 2>
                iree_linalg_ext.pack %subview_4 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [4, 8] into %alloc_5 : (memref<32x32xi32, strided<[2048, 1], offset: ?>, 1> memref<4x8x4x8xi32, 2>)

I think your point makes sense, and it should be legal to fold them into pack ops.

nirvedhmeshram commented 8 months ago

@erwei-xilinx I was thinking the folder will happen when doing the pack to dma lowering, I already have an example in the IR linked in this message https://github.com/nod-ai/iree-amd-aie/issues/148#issuecomment-1959952187

erwei-xilinx commented 8 months ago

I see. I understand now. Thanks!

jtuyls commented 8 months ago

Great observation about the strides, it took me some time to digest and definitely want to have a discussion about it. So we are setting the strides for accesing in the packed layout defined by memref<1x1x8x16> and lets call these dimensions here the interesting thing I learned is that the tiledA dimension is still strided as outer to B dimension so the memory layout is ...

Thanks, this makes sense now for the case where tiling size last dimension == last dimension size. It would be useful though for me to add another example/test like func1 but without outer_dims_perm to see how the transformation works on a pack with tiling sizes different from initial sizes, but no permutation happening. Could go directly into the PR though.

nirvedhmeshram commented 8 months ago

@jtuyls I have updated the gist to include the case you suggested and also added unpack examples https://gist.github.com/nirvedhmeshram/188a0770133a0e03c40d234776a57080

Going to do a clean up/ add tests and mark the PR ready for review soon as well.