Open powderluv opened 9 months ago
@erwei-xilinx , please let me know what your thoughts are on this approach.
@powderluv @nirvedhmeshram I will need more context about this issue. air.dma_memcpy_nd
op does have fields which correspond to wraps: the op is essentially a combination one memcpy op plus two subviews (one for src memref and one for dst). Each subview in the op comes with a sizes
field which is essentially wraps
. The subview also comes with strides
and offsets
to represent an inner tile.
Currently we do have a workflow which can perform one-to-one mapping between iree_linalg_ext.pack/unpack
to air.dma_memcpy_nd
. This workflow consists of two passes, where the first one called iree-amdaie-decompose-pack-unpack-to-air
which currently is under iree-amd-aie repo; the second one called air-copy-to-dma
which is under mlir-air.
From air.memcpy_nd_op
down it is more complicated. aie.dma
doesn't exist, and mlir-aie dialect represents data movers using DMA BD programs which map to physical AIE DMAs. The pass -air-to-aie
in mlir-air currently handles the lowering from logical air.dma_memcpy_nd
to physical AIE DMA BDs.
air.dma_memcpy_nd
, as a logical representation of data movement, is currently agnostic wrt AIE architecture versions: the op is compatible with both AIE1 and AIE2 architectures. Architecture specific hw constraints are enforced in the downstream passes which lowers the logical data movement to physical resources.
We can squash the whole process and put the pack-to-aie flow all within one single pass, but it may not be one to one, if the later is one single DMA BD, because due to AIE2 BD constraints (wrap cannot go beyond [0, 1023], # dims cannot go beyond 4 for shim/memtile and 3 for tile, only the highest dim wrap can do repeat, etc), one logical pack/unpack data movement may need to lower to a finite-state-machine consisting of multiple BDs.
This makes sense... I think the main idea is to fold the logic of iree-amdaie-decompose-pack-unpack-to-air
and what happens in air-copy-to-dma
operation. They shouldnt need to be split into two parts. @nirvedhmeshram , you and I maybe need to find some time to look at IR and go a bit deep on this issue.
I think unifying the iree-amdaie-decompose-pack-unpack-to-air
and air-copy-to-dma
makes sense to me, now that MLIR-AIR is integrating into IREE. The original rationale behind this separation was that MLIR-AIR could use the memref.transpose/subview/expand_shape
ops as frontend in addition to iree_linalg_ext.pack/unpack
.
Here are some unit tests on iree-amd-aie
side for iree-amdaie-decompose-pack-unpack-to-air
, and here are some on mlir-air
side for air-copy-to-dma
.
@erwei-xilinx could you please explain the func1 in the decompose-pack lowering here I am wondering why we needed an expand shape and not keep the same buffer dimensions from the original pack op? Secondly, when I try to apply the condensation pattern from air-copy-to-dma (note I didnt apply the entire pass) on the resulting IR, I am getting a crash.
@nirvedhmeshram The memref.expand_shape op expands the shape of the memref to increase the rank of the memref from 4 to 6, then the memref.transpose op performs a permutation, exchanging dim 2 with 4. I don't think memref.transpose can permute a rank 4 memref into a rank 6 memref.
The output of that test is expected to fail in -air-copy-to-dma
, because it is just a snapshot of a series of memref layout transformation ops, where I deleted the "layout" field of the input memref type to memref.expand_shape for simplicity. If you look at func6, which is the complete GEMM example, and search for memref.expand_shape, then you will see it's original form of that func1 test, and how the memref.expand_shape input inherits its layout from predecesors.
If the chain of memref layout transformation is incomplete, then -air-copy-to-dma
cannot successfully fold the memref ops into one memref type. The crash comes from there. We should probably put some error message there...
@erwei-xilinx I was wondering about something in how the offset propagation is done when we rank reduce in the current lowering, lets take this dma operation post condensation https://gist.github.com/nirvedhmeshram/8e7508a34a5d40b55c1a980c6f93496b#file-condensed_dma-mlir-L33
If you compare it from the starting point here https://github.com/nod-ai/iree-amd-aie/blob/main/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/pack_to_air.mlir#L144
Since offset step can be calculated as size x stride In the final dma case the the offset step that will be applied is 2x8=16, but the original pack case the offset step is 1x128=128, so is this technically correct or something our codegen is not getting affected by currently due to these kind of offsets always being 0?
@nirvedhmeshram I think at the moment it's the latter if I understand the question correctly. The pack lowering to AIR hasn't been exercised extensively with offsets not being zero. Once lowered to AIR we have some examples showing how offsets are propagated.
@erwei-xilinx thanks for clarifying, also after sending that message I realized I might not be thinking of offsets correctly, are they just multiplied with the stride, to get the starting element , or are they multiplied with the stride AND the size, in either case I think the lowering is not ending up with the correct offset but understanding the semantics will help us do the correct offset lowring in the new pattern.
@erwei-xilinx thanks for clarifying, also after sending that message I realized I might not be thinking of offsets correctly, are they just multiplied with the stride, to get the starting element , or are they multiplied with the stride AND the size, in either case I think the lowering is not ending up with the correct offset but understanding the semantics will help us do the correct offset lowring in the new pattern.
Are we talking about subview
or the DMA op?
@erwei-xilinx thanks for clarifying, also after sending that message I realized I might not be thinking of offsets correctly, are they just multiplied with the stride, to get the starting element , or are they multiplied with the stride AND the size, in either case I think the lowering is not ending up with the correct offset but understanding the semantics will help us do the correct offset lowring in the new pattern.
Are we talking about
subview
or the DMA op?
I think we are agreeing that The DMA op is not offsetting by same number of elements as the sub view op in the pack representation in the two IR's I linked.
@erwei-xilinx thanks for clarifying, also after sending that message I realized I might not be thinking of offsets correctly, are they just multiplied with the stride, to get the starting element , or are they multiplied with the stride AND the size, in either case I think the lowering is not ending up with the correct offset but understanding the semantics will help us do the correct offset lowring in the new pattern.
Are we talking about
subview
or the DMA op?
I was asking about DMA, but I could not find documentation for sub view either, although I would assume it is just multiplied to the stride in that case.
@nirvedhmeshram I think it might be useful to capture any offset errors with a unit test, which helps to show whether pack is not generating the memref.subview correctly, or memref.subview is not generating DMA correctly.
@erwei-xilinx thanks for clarifying, also after sending that message I realized I might not be thinking of offsets correctly, are they just multiplied with the stride, to get the starting element , or are they multiplied with the stride AND the size, in either case I think the lowering is not ending up with the correct offset but understanding the semantics will help us do the correct offset lowring in the new pattern.
Are we talking about
subview
or the DMA op?I was asking about DMA, but I could not find documentation for sub view either, although I would assume it is just multiplied to the stride in that case.
It's hard to describe in chat. Lets chat tomorrow sometime about it
I have a pattern for pack to dma here https://github.com/nod-ai/iree-amd-aie/pull/178
Here is how the IR looks like https://gist.github.com/nirvedhmeshram/188a0770133a0e03c40d234776a57080 so the first thing to verify is, if this is the correct lowering, @erwei-xilinx @stephenneuendorffer @MaheshRavishankar
also @jtuyls @yzhang93
Hi @nirvedhmeshram , I have a couple of questions:
1) In func0
, why do we have [%c128, %c16, %c16, %c1]
here on the src strides? I would have expected [%c128, %c128, %c16, %c1]
. Because size is 1
for that dimension, it shouldn't matter, but it looks weird to me that 16
is generated instead of 128
.
air.dma_memcpy_nd (%alloc[%c0, %c0, %c0, %c0] [%c1, %c1, %c8, %c16] [%c128, %c128, %c16, %c1], %alloc_0[%c0, %c0, %c0, %c0] [%c1, %c1, %c8, %c16] [%c128, %c16, %c16, %c1]) : (memref<1x1x8x16xi32, 1>, memref<8x16xi32, 1>)
2) Right now, the pack operation seems to be completely offloaded to the src side of the air.dma_memcpy_nd
and dest
is always a linear write? Is this on purpose or for simplicity? This does have some implications, for example exceeding BD dimensionality/running out of BDs more quickly due to limited dimensionality of the BDs. Also, in general there is a choice to be made here with multiple solutions and this can affect performance (maybe a consideration for down the road). A different way of converting pack would be to split the transformation across src and dest, for example by handling the tile part on the src side and the transpose part on the dest side. Not sure whether this does matter as this could potentially be rearranged/optimized down the line as well. Any thoughts from you, @erwei-xilinx ?
Hi @nirvedhmeshram , I have a couple of questions:
- In
func0
, why do we have[%c128, %c16, %c16, %c1]
here on the src strides? I would have expected[%c128, %c128, %c16, %c1]
. Because size is1
for that dimension, it shouldn't matter, but it looks weird to me that16
is generated instead of128
.air.dma_memcpy_nd (%alloc[%c0, %c0, %c0, %c0] [%c1, %c1, %c8, %c16] [%c128, %c128, %c16, %c1], %alloc_0[%c0, %c0, %c0, %c0] [%c1, %c1, %c8, %c16] [%c128, %c16, %c16, %c1]) : (memref<1x1x8x16xi32, 1>, memref<8x16xi32, 1>)
- Right now, the pack operation seems to be completely offloaded to the src side of the
air.dma_memcpy_nd
anddest
is always a linear write? Is this on purpose or for simplicity? This does have some implications, for example exceeding BD dimensionality/running out of BDs more quickly due to limited dimensionality of the BDs. Also, in general there is a choice to be made here with multiple solutions and this can affect performance (maybe a consideration for down the road). A different way of converting pack would be to split the transformation across src and dest, for example by handling the tile part on the src side and the transpose part on the dest side. Not sure whether this does matter as this could potentially be rearranged/optimized down the line as well. Any thoughts from you, @erwei-xilinx ?
memref<1x1x8x16>
and lets call these dimensions <AxBxtiledAxtiledB>
here the interesting thing I learned is that the tiledA dimension is still strided as outer to B dimension so the memory layout is
iterate A (orignally stride 16, due to tiling stride is = orignal stride x tile size = 16x8 = 128)
iterate tiledA (stride = original outer dim(A) stride = 161)
iterate B (originally stride 1, due to tiling stride is = orginal stride x tile size =1x16 = 162
iterate tiled B (stride = original outer dim(B) stride = 1)
so new strides for <AxBxtiledAxtiledB> =
<128x162x161x1>
@nirvedhmeshram and @jtuyls Re 2, the current pack lowering in AIR is generating wraps and strides which operate only on sources and not destinations. Although this doesn't mean that the destination-side DMAs are not doing any data rearrangements (tiling generates memref.subviews
on destination-side DMA BDs), there could still be extra hw benefits to offloading some data rearrangements on the destination side. AIR currently isn't able to do that.
In AIR, pack and unpack operations are not the only operations that generate wraps and strides to DMA BDs. The memref.subviews
, generated from tiling, also could rely on DMA BDs to do data rearrangements. Given that the pack-to-dma lowering is now direct, there may be the need for an additional mechanism to update the wraps and strides if there's an extra memref.subview
?
In AIR, pack and unpack operations are not the only operations that generate wraps and strides to DMA BDs. The
memref.subviews
, generated from tiling, also could rely on DMA BDs to do data rearrangements. Given that the pack-to-dma lowering is now direct, there may be the need for an additional mechanism to update the wraps and strides if there's an extramemref.subview
?
If the subview is consumed by a pack/unpack operation the direct lowering should fold it in, if there are other consumers of the sub-view then yes potentially need a new lowering.
@nirvedhmeshram I did a quick experiment with a Transform Dialect tiling script, and see that at the moment the memref.subview
ops (or tensor.extract_slice
) do not get automatically folded into pack. See code snippet below:
%7 = affine.apply #map1()[%arg2]
%8 = affine.apply #map2()[%arg4]
%subview_4 = memref.subview %alloc[%7, %8] [32, 32] [1, 1] : memref<64x2048xi32, 1> to memref<32x32xi32, strided<[2048, 1], offset: ?>, 1>
%alloc_5 = memref.alloc() : memref<4x8x4x8xi32, 2>
iree_linalg_ext.pack %subview_4 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [4, 8] into %alloc_5 : (memref<32x32xi32, strided<[2048, 1], offset: ?>, 1> memref<4x8x4x8xi32, 2>)
I think your point makes sense, and it should be legal to fold them into pack ops.
@erwei-xilinx I was thinking the folder will happen when doing the pack to dma lowering, I already have an example in the IR linked in this message https://github.com/nod-ai/iree-amd-aie/issues/148#issuecomment-1959952187
I see. I understand now. Thanks!
Great observation about the strides, it took me some time to digest and definitely want to have a discussion about it. So we are setting the strides for accesing in the packed layout defined by memref<1x1x8x16> and lets call these dimensions
here the interesting thing I learned is that the tiledA dimension is still strided as outer to B dimension so the memory layout is ...
Thanks, this makes sense now for the case where tiling size last dimension == last dimension size. It would be useful though for me to add another example/test like func1
but without outer_dims_perm
to see how the transformation works on a pack with tiling sizes different from initial sizes, but no permutation happening. Could go directly into the PR though.
@jtuyls I have updated the gist to include the case you suggested and also added unpack examples https://gist.github.com/nirvedhmeshram/188a0770133a0e03c40d234776a57080
Going to do a clean up/ add tests and mark the PR ready for review soon as well.
Currently the iree_linalg_ext.pack operations which map well to the aie dma operations are decomposed to air.dma_memcpy_nd which doesnt seem to model it as well since it doesnt have a concept of wrap or inner tile. We should add a new op in air dialect that maps one to one to iree_linalg_ext.pack-> air.dma_op -> aie.dma_op