Open Abhishek-Varma opened 2 weeks ago
@Abhishek-Varma I've rebased your fix commits on top of Jorn's latest branch which is ahead of your branch by several important commits. I also fixed another problem in distribute-core-and-objectfifo-pass
, and now the lowered IR looks correct to me after this pass. Also with the change, it no longer fails after AccessToAcquireRelease
, but fails at a later pass create-logical-objectfifo-link
.
Please refer to this new branch: https://github.com/nod-ai/iree-amd-aie/tree/objectfifo_vectorization And the latest dump IR: matmul_objectfifo_vectorize.txt
Hi @yzhang93 - I didn't get any failure in create-logical-objectfifo-link
.
The failure with the https://github.com/nod-ai/iree-amd-aie/tree/objectfifo_vectorization branch occurs at lower-to-aie
pass instead (the final pass in the pipeline).
And that failure turned out to a red-herring, the main issue is at distribute-core-and-objectfifo
pass which is caused due to the following structure we get due to fold-memref-alias
pass :-
%a = memref.subview %alloc
// PROLOGUE
aie.core {
linalg.fill (%cst, %a)
scf.for // vectorized matmul loop nest 1
scf.for // vectorized matmul loop nest 2
scf.for // vectorized matmul loop nest 3
%b = memre.subview %alloc
linalg.generic ... outs(%b)
}
// MAIN
aie.core {
scf.for
scf.for // vectorized matmul loop nest 1
scf.for // vectorized matmul loop nest 2
scf.for // vectorized matmul loop nest 3
%b = memre.subview %alloc
linalg.generic ... outs(%b)
}
// EPILOGUE
aie.core {
scf.for // vectorized matmul loop nest 1
scf.for // vectorized matmul loop nest 2
scf.for // vectorized matmul loop nest 3
%b = memre.subview %alloc
linalg.generic ... outs(%b)
}
%a
is memref<2x2x8x8x4x4xi32, 2 : i32> to memref<1x1x8x8x4x4xi32, 2>
(only considering shape)
%b
is memref<2x2x8x8x4x4xi32, 2 : i32> to memref<1x1x1x1x4x4xi32, 2>
(again, only considering shape)
We get the following error message from lower-to-aie
pass because the AIE objectfifo link op fails the verification :-
<unknown>:0: error: Total size of input objFifos in ObjectFifoLinkOp must be equal to size of output objFifo
<unknown>:0: note: see current operation: "aie.objectfifo.link"() <{fifoIns = [@obj6, @obj7, @obj8, @obj9], fifoOuts = [@obj10]}> : () -> ()
And the reason for that is that the current logic in distribute-core-and-objectfifo
incorrectly replaces %a
(which is memref<1x1x8x8x4x4xi32, 2>
) with a new alloc op (which is memref<1x1x1x1x4x4xi32, 2>
).
Therefore we need to change distributeLocalMemory function to accommodate the same (CC: @MaheshRavishankar @jtuyls ).
Here is the gist of the e2e IR's current state : e2e IR vectorization.
I'm hoping that with the above suggested fix we might be able to get through to AIE Dialect.
I tried two ways to solve the problem, but haven't got it work.
I was able to replace %subview = memref.subview %alloc_3[%arg2, %arg3, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : memref<2x2x8x8x4x4xi32, 2 : i32> to memref<1x1x8x8x4x4xi32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>
, but had problem with this %subview_13 = memref.subview %alloc_3[%arg2, %arg3, %arg5, %arg4, 0, 0] [1, 1, 1, 1, 4, 4] [1, 1, 1, 1, 1, 1] : memref<2x2x8x8x4x4xi32, 2 : i32> to memref<1x1x1x1x4x4xi32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>
.
I think instead of the above one subview, we probably still need two subview ops, i.e., subview ops before FoldMemRefAliasOps
pass.
%subview_13 = memref.subview %alloc_3[%arg2, %arg3, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : memref<2x2x8x8x4x4xi32, 2 : i32> to memref<1x1x8x8x4x4xi32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>
%subview_16 = memref.subview %subview_13[0, 0, %arg5, %arg4, 0, 0] [1, 1, 1, 1, 4, 4] [1, 1, 1, 1, 1, 1] : memref<1x1x8x8x4x4xi32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32> to memref<1x1x1x1x4x4xi32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>
So I come to the second way to approach it.
FoldMemRefAliasOps
to avoid folding these two subview ops, and got IR dump as https://gist.github.com/yzhang93/48b5a4fb60dda465158f9dabbab14d91. And it now has a different error when running DistributeCoresAndObjectFifos
pass.I'm not fully sure which way is more reasonable and can eventually fix the problem.
Hi @yzhang93 - so I worked on adding a fix from scratch within distribute-core-and-objectfifo
pass itself and was able to make it work.
There were two other issues afterwards in the same pass :-
i. Objectfifo needs to be attached tiles using findUsersInCoreAndAddTiles
as a helper method.
ii. The other was for inserting logical objectfifo access for linalg op's operands using insertLogicalObjectFifoAccess
method.
I triaged and found that I have already solved those two specific issues in avarma_matmul_elem branch but saw that a few part of the code snippets pertaining to the fixes were deleted/modified as part of your objectfifo_vectorization branch - I'm not sure why though. Could you help explain? Because the fixes I added looks good to me and is getting us the expected IR as discussed - so I'm trying to understand the rationale.
I've pushed all changes on branch avarma_matmul_vectorization_objectfifo instead.
Here is the current state of the e2e IR log now after the above fixes : e2e IR matmul + vectorization + objectfifo
CC: @MaheshRavishankar @jtuyls
I triaged and found that I have already solved those two specific issues in avarma_matmul_elem branch but saw that a few part of the code snippets pertaining to the fixes were deleted/modified as part of your objectfifo_vectorization branch - I'm not sure why though.
As discussed offline, with your previous branch it didn't generate the correct IR. There were three problems: 1) access op doesn't have the correct memory pattern; 2) the inputs doesn't broadcast correctly; 3) it generates an incorrect allocation as pointed in the above comment https://github.com/nod-ai/iree-amd-aie/issues/431#issuecomment-2176505316.
My branch objectfifo_vectorization was able to solve the first two issues but not the third one.
Now with your new branch, the third problem has been fixed however the first two are remaining issues. I'll take another look today to see if we can solve all of the three problems at the same time.
So @yzhang93 and I discussed this offline.
Context:
The distribute-core-and-objectfifo
pass is structured in the following way :-
....
A: distributeLocalMemory (this involves the AllocOp fix)
...
B: insertLogicalObjectFifoAccess (this involves the issue pointed above in `point ii`)
...
C: assignLocalTiles (the one invoking the helper method findUsersInCoreAndAddTiles as mentioned above in `point i`)
...
The structure of the current has the alloc op replacements correct - but we also need the following (when comparing with Matmul without vectorization) :-
We also need to have amdaie.logicalobjectfifo.access(%SSA, Read/Write)
in the IR which currently the pass is setting all as amdaie.logicalobjectfifo.access(%SSA, None)
.
We also need to have broadcasted allocs which currently (I just saw while adding this update) we get as :-
%13 = amdaie.logicalobjectfifo.from_memref %alloc, {%tile_17} : memref<1x1x8x4x8x4xi32, 2 : i32> -> !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>>
%14 = amdaie.logicalobjectfifo.from_memref %alloc, {%tile_18} : memref<1x1x8x4x8x4xi32, 2 : i32> -> !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>>
%15 = amdaie.logicalobjectfifo.from_memref %alloc, {%tile_19} : memref<1x1x8x4x8x4xi32, 2 : i32> -> !amdaie.logicalobjectfifo<memref<1x1x8x4x8x4xi32, 2 : i32>>
%16 = amdaie.logicalobjectfifo.from_memref %alloc, {%tile_20}
In case of Matmul it is %13 = amdaie.logicalobjectfifo.from_memref %alloc, {%tile_17, %tile_18}
(@yzhang93 correct me if I'm wrong).
There were two other issues afterwards in the same pass :- i. Objectfifo needs to be attached tiles using findUsersInCoreAndAddTiles as a helper method. ii. The other was for inserting logical objectfifo access for linalg op's operands using insertLogicalObjectFifoAccess method.
I triaged and found that I have already solved those two specific issues in avarma_matmul_elem branch but saw that a few part of the code snippets pertaining to the fixes were deleted/modified as part of your objectfifo_vectorization branch - I'm not sure why though. Could you help explain? Because the fixes I added looks good to me and is getting us the expected IR as discussed - so I'm trying to understand the rationale.
So the fixes which @yzhang93 added was part of B
and C
to match it with that of the Matmul. But after fixing A
it crashed.
Therefore now we need to address B
and C
with the context of the AllocOps getting replaced.
CC: @MaheshRavishankar @jtuyls
@Abhishek-Varma Based on your new branch, I've fixed the other two issues in DistributeCoresAndObjectFifos
pass. So now the three issues we observed earlier should be all fixed.
Please check the current IR dump here https://gist.github.com/yzhang93/d5cbb97cf2790c6328fd3daef2a34814.
After the fix, now the lowering with vectorization fails at AMDAIELowerToAIE
with error
<stdin>:20:16: error: different memory spaces specified for base memref type 'memref<1x1x4x8x4x8xi32, 1>' and subview memref type 'memref<1x1x1x1x4x8xi32, strided<[1024, 1024, 256, 32, 8, 1], offset: ?>, 2 : i32>'
This is simply because AMDAIELowerToAIE
doesn't have a function to rewrite the memref.subview
ops which are the operands of linalg.generic
ops. And now the memory space of the subview ops should be rewritten to 1
according to https://github.com/nod-ai/iree-amd-aie/blob/avarma_matmul_vectorization_objectfifo/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIELowerToAIE.cpp#L147.
Could you please help solve this problem today? Thanks.
I've added the fix for handling memref::SubViewOp
in lower-to-aie
and the changes are in : avarma_matmul_vectorization_objectfifo branch.
We are now able to get to the final AIE dialect IR.
Here is the current state of the IR : e2e IR Matmul + Vectorization + Objectfifo. I tried skimming through the Matmul without vectorization IR - overall structure looks okay to me.
CC: @MaheshRavishankar @jtuyls @yzhang93
@Abhishek-Varma With your fix in LowerToAIE
, it generates bad codes as below
scf.for %arg1 = %c0 to %c8 step %c1 {
scf.for %arg2 = %c0 to %c8 step %c1 {
scf.for %arg3 = %c0 to %c4 step %c1 {
linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%reinterpret_cast_11, %reinterpret_cast_10 : memref<1x1x4x8x4x8xi32, 1>, memref<1x1x8x4x8x4xi32, 1>) outs(%reinterpret_cast : memref<1x1x8x8x4x4xi32, 1>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
^bb0(%in: i32, %in_14: i32, %out: i32):
%13 = arith.muli %in, %in_14 : i32
%14 = arith.addi %out, %13 : i32
linalg.yield %14 : i32
}
}
}
}
The subview ops were gone and the linalg.generic
operands were not using the vectorization instruction sizes. I pushed another commit in the same branch which has fixed the above issue.
Now it can compile and generate vmfb, but has numerical issue. For example, with the inputs
iree-run-module --device=xrt --module=pack_peel.vmfb \
--input=128x256xi32=1 --input=256x128xi32=2 --function=matmul_i32
All the values in the output matrix should be 256, but what I got have zeros every 32 elements.
128x128xi32=[256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...]
Now it can compile and generate vmfb, but has numerical issue. For example, with the inputs
iree-run-module --device=xrt --module=pack_peel.vmfb \ --input=128x256xi32=1 --input=256x128xi32=2 --function=matmul_i32
All the values in the output matrix should be 256, but what I got have zeros every 32 elements.
128x128xi32=[256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...][...]
I've found the bug, and pushed a fix in avarma_matmul_vectorization_objectfifo branch.
The issue was because when the subview op is replaced from memref<2x2x8x8x4x4xi32, 2 : i32>
to memref<1x1x8x8x4x4xi32, 2 : i32>
, the offsets of the first two dimensions were not properly changed.
After all the fixes, objectfifo + vectorization
pipeline now works for matmul example and generates correct results.
Great job folks!!! This puts us on a great path! Looking forward to changing all matmuls to go down this path
@Abhishek-Varma and I will continue using this issue as a tracker for debugging and fixing issues for bf16 vectorization.
With nothing changed from the previous branch, it crashed at DistributeCoresAndObjectFifos
. But when I looked at the IR, the IR was not generated correctly before that. In short words, the current passes don't have support for vector dialect and operations.
The codes within a core for i32 type, for example:
%22 = amdaie.core(%tile) {
amdaie.logicalobjectfifo.consume(%19)
amdaie.logicalobjectfifo.consume(%20)
linalg.fill ins(%c0_i32 : i32) outs(%subview : memref<1x1x8x8x4x4xi32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>)
scf.for %arg4 = %c0 to %c8 step %c1 {
scf.for %arg5 = %c0 to %c8 step %c1 {
scf.for %arg6 = %c0 to %c4 step %c1 {
%subview_5 = memref.subview %alloc_0[0, 0, %arg6, %arg4, 0, 0] [1, 1, 1, 1, 4, 8] [1, 1, 1, 1, 1, 1] : memref<1x1x4x8x4x8xi32, 2 : i32> to memref<1x1x1x1x4x8xi32, strided<[1024, 1024, 256, 32, 8, 1], offset: ?>, 2 : i32>
%subview_6 = memref.subview %alloc[0, 0, %arg5, %arg6, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : memref<1x1x8x4x8x4xi32, 2 : i32> to memref<1x1x1x1x8x4xi32, strided<[1024, 1024, 128, 32, 4, 1], offset: ?>, 2 : i32>
%subview_7 = memref.subview %alloc_3[%arg2, %arg3, %arg5, %arg4, 0, 0] [1, 1, 1, 1, 4, 4] [1, 1, 1, 1, 1, 1] : memref<2x2x8x8x4x4xi32, 2 : i32> to memref<1x1x1x1x4x4xi32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>
linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%subview_5, %subview_6 : memref<1x1x1x1x4x8xi32, strided<[1024, 1024, 256, 32, 8, 1], offset: ?>, 2 : i32>, memref<1x1x1x1x8x4xi32, strided<[1024, 1024, 128, 32, 4, 1], offset: ?>, 2 : i32>) outs(%subview_7 : memref<1x1x1x1x4x4xi32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
^bb0(%in: i32, %in_8: i32, %out: i32):
%23 = arith.muli %in, %in_8 : i32
%24 = arith.addi %out, %23 : i32
linalg.yield %24 : i32
}
}
}
}
amdaie.end
}
Now for bf16 becomes:
%22 = amdaie.core(%tile) {
amdaie.logicalobjectfifo.consume(%19)
amdaie.logicalobjectfifo.consume(%20)
linalg.fill ins(%cst : bf16) outs(%subview : memref<1x1x16x16x4x4xbf16, strided<[8192, 4096, 256, 16, 4, 1], offset: ?>, 2 : i32>)
scf.for %arg4 = %c0 to %c16 step %c1 {
scf.for %arg5 = %c0 to %c16 step %c1 {
scf.for %arg6 = %c0 to %c8 step %c1 {
%23 = vector.transfer_read %alloc_0[%c0, %c0, %arg6, %arg4, %c0, %c0], %cst {in_bounds = [true, true, true, true, true, true]} : memref<1x1x8x16x4x8xbf16, 2 : i32>, vector<1x1x1x1x4x8xbf16>
%24 = vector.transfer_read %alloc[%c0, %c0, %arg5, %arg6, %c0, %c0], %cst {in_bounds = [true, true, true, true, true, true]} : memref<1x1x16x8x8x4xbf16, 2 : i32>, vector<1x1x1x1x8x4xbf16>
%25 = vector.transfer_read %alloc_3[%arg2, %arg3, %arg5, %arg4, %c0, %c0], %cst {in_bounds = [true, true, true, true, true, true]} : memref<2x2x16x16x4x4xbf16, 2 : i32>, vector<1x1x1x1x4x4xbf16>
%26 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"], kind = #vector.kind<add>} %23, %24, %25 : vector<1x1x1x1x4x8xbf16>, vector<1x1x1x1x8x4xbf16> into vector<1x1x1x1x4x4xbf16>
vector.transfer_write %26, %alloc_3[%arg2, %arg3, %arg5, %arg4, %c0, %c0] {in_bounds = [true, true, true, true, true, true]} : vector<1x1x1x1x4x4xbf16>, memref<2x2x16x16x4x4xbf16, 2 : i32>
}
}
}
amdaie.end
}
I haven't finished adding supports for all places that have vector.contract
, but @Abhishek-Varma could take my WIP commit as a start point for adding other similar supports.
Fix commit : Fix distribute-core-and-objectfifo + initial fix to lower-to-aie.
This is on the same branch : avarma_matmul_vectorization_objectfifo.
Current state : e2e IR log
Fixes added :-
vector.transfer_write
had to be dealt while distributing local memory - distribute-core-and-objectfifo
.vector
ops for inserting logical objectfifos - this entailed ensuring that the vector.transfer_read
and vector.transfer_write
from the SAME objectfifo is indeed using the same logicalobjectfifo.access op - distribute-core-and-objectfifo
.bf16
memref types at func.func input arguments - lower-to-aie
.CC: @MaheshRavishankar @jtuyls @yzhang93
Current IR state : new e2e IR state
Previous IR state : old e2e IR state
The difference between the two is the func.func (....)
part :-
OLD IR
func.func @matmul_i8_i32_dispatch_0_matmul_128x128x256_bf16(%arg0: memref<128x256xi32>,
%arg1: memref<256x128xi32>,
%arg2: memref<128x128xi32>) {
aiex.npu.dma_memcpy_nd(0, 0, %arg0[1, 0, 0, 0][1, 2, 64, 64][1, 16384, 256]) {id = 0 : i64, issue_token = true, metadata = @obj0} : memref<128x256xi32>
aiex.npu.dma_wait {symbol = @obj0}
NEW IR
func.func @matmul_i8_i32_dispatch_0_matmul_128x128x256_bf16(%arg0: memref<128x128xi32>,
%arg1: memref<256x64xi32>,
%arg2: memref<128x64xi32>) {
aiex.npu.dma_memcpy_nd(0, 0, %arg0[1, 0, 0, 0][1, 2, 64, 32][1, 8192, 128]) {id = 0 : i64, issue_token = true, metadata = @obj0} : memref<128x128xi32>
aiex.npu.dma_wait {symbol = @obj0}
Changes done within lower-to-aie
:-
LHS
, RHS
and OUTPUT
args' last dimension was halved based on the element type bit width.aie.npu.dma_cpy_nd
were halved if not 1 - this should be done ONLY if the element type is bf16
(but currently I'm enforcing it anyway - something to handle later when the work is PR-ready).Changes are in the same branch : avarma_matmul_vectorization_objectfifo
Tried going through createEmulateNarrowTypes but that seems to be working through the subspan HAL binding ops, something which lower-to-aie
erases after binding the input/objectfifos.
CC: @MaheshRavishankar @jtuyls @yzhang93
Hi.
I first tried looking into the emulate-narrow-types
pass
a. As @yzhang93 rightly mentioned, it works only for integers so I added few changes to deal with floats while upcasting to integer.
b. The pass basically changes ALL memref/vector to the intended upcast data type - but that's NOT what we want (please correct me if I'm wrong).
c. I then switched on just the logic for HAL binding subspan op - it does create a linearized i32 version of the respective
bf16 types but it didn't work because there's NO way (that I'm aware of) to replace <axbxbf16>
with <dxi32>
.
Based on that the point 1.b
above makes sense.
I thought of looking into NON-objectfifo bf16 IR instead - because clearly we aren't dealing with bf16 in AIE codegen for the first time.
a. Here is the bf16 IR for the case which works (non-objectfifo) : e2e IR for non-objectfifo bf16 matmul.
b. Based on that I tried looking into AIRRtToNpu
pass and found how they're adding the upcast after linearizing (CC: @MaheshRavishankar ).
c. I used that logic within lower-to-aie
and was able to get the current e2e IR log.
Since the func.func
change we want looked okay to me (2.c
) I tried running that to generate .vmfb (wanted to see if at all it works) - it failed at LowerVectorToAIEVec
pass : e2e IR log during .vmfb generation
I can try looking into LowerVectorToAIEVec
but I think I might just be chasing a red-herring here.
As per my understanding of 1.b
- ALL the structure that's NOT control code in the IR will remain SAME - we just need to update the func.func
and the nput.dma_cpy_nd
which makes the control code - and this is what 2.b
does.
CC: @MaheshRavishankar @jtuyls @yzhang93
- Since the
func.func
change we want looked okay to me (2.c
) I tried running that to generate .vmfb (wanted to see if at all it works) - it failed atLowerVectorToAIEVec
pass : e2e IR log during .vmfb generation- I can try looking into
LowerVectorToAIEVec
but I think I might just be chasing a red-herring here.
It seems to crash on converting the vector.contract
op to an aievec::matmul
op. Could you try using a f32 output instead of bf16 as bf16 output is not supported in the aievec::matmul
op: https://github.com/Xilinx/mlir-aie/blob/d850560c77799af96c6361a79e34cb0a8e842c50/include/aie/Dialect/AIEVec/IR/AIEVecOps.td#L869
- Since the
func.func
change we want looked okay to me (2.c
) I tried running that to generate .vmfb (wanted to see if at all it works) - it failed atLowerVectorToAIEVec
pass : e2e IR log during .vmfb generation- I can try looking into
LowerVectorToAIEVec
but I think I might just be chasing a red-herring here.It seems to crash on converting the
vector.contract
op to anaievec::matmul
op. Could you try using a f32 output instead of bf16 as bf16 output is not supported in theaievec::matmul
op: https://github.com/Xilinx/mlir-aie/blob/d850560c77799af96c6361a79e34cb0a8e842c50/include/aie/Dialect/AIEVec/IR/AIEVecOps.td#L869
I tried bf16
input and f32
output/accumulator as the input dispatch.
func.func @matmul_bf16_f32(%lhs: tensor<128x256xbf16>, %rhs: tensor<256x128xbf16>) -> tensor<128x128xf32>
{
%cst = arith.constant 0.0 : f32
%0 = tensor.empty() : tensor<128x128xf32>
%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<128x128xf32>) -> tensor<128x128xf32>
%res = linalg.matmul ins(%lhs, %rhs: tensor<128x256xbf16>, tensor<256x128xbf16>)
outs(%1: tensor<128x128xf32>) -> tensor<128x128xf32>
return %res : tensor<128x128xf32>
}
It seems to have yet another issue because this is the loop nest after vectorization :-
scf.for %arg5 = %c0 to %c16 step %c1 {
scf.for %arg6 = %c0 to %c16 step %c1 {
scf.for %arg7 = %c0 to %c8 step %c1 {
%27 = vector.transfer_read %alloc_1[%c0, %c0, %arg7, %arg5, %c0, %c0], %cst {in_bounds = [true, true, true, true, true, true]} : memref<1x1x8x16x4x8xbf16, 2 : i32>, vector<1x1x1x1x4x8xbf16>
%28 = vector.transfer_read %alloc[%c0, %c0, %arg6, %arg7, %c0, %c0], %cst {in_bounds = [true, true, true, true, true, true]} : memref<1x1x16x8x8x4xbf16, 2 : i32>, vector<1x1x1x1x8x4xbf16>
%29 = vector.transfer_read %alloc_4[%arg3, %arg4, %arg6, %arg5, %c0, %c0], %cst_0 {in_bounds = [true, true, true, true, true, true]} : memref<2x2x16x16x4x4xf32, 2 : i32>, vector<1x1x1x1x4x4xf32>
%30 = arith.extf %27 : vector<1x1x1x1x4x8xbf16> to vector<1x1x1x1x4x8xf32>
%31 = arith.extf %28 : vector<1x1x1x1x8x4xbf16> to vector<1x1x1x1x8x4xf32>
%32 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"], kind = #vector.kind<add>} %30, %31, %29 : vector<1x1x1x1x4x8xf32>, vector<1x1x1x1x8x4xf32> into vector<1x1x1x1x4x4xf32>
vector.transfer_write %32, %alloc_4[%arg3, %arg4, %arg6, %arg5, %c0, %c0] {in_bounds = [true, true, true, true, true, true]} : vector<1x1x1x1x4x4xf32>, memref<2x2x16x16x4x4xf32, 2 : i32>
}
}
}
So now we have f32
inputs and f32
outputs apparently. Although it crashed in the distribute-core-and-objectfifo
pass (and I understand why) this would cause issue again in the lower stack at LowerVectorToAIEVec
pass since from the AIEVecOps.td it seems to not support even this configuration.
On further look I see that this is not because of vectorization pass, because the input linalg.generic
to the vectorization part is :-
%19 = linalg.generic ins(%pack_15, %pack_17 : tensor<1x1x8x16x4x8xbf16>, tensor<1x1x16x8x8x4xbf16>)
outs(%18 : tensor<1x1x16x16x4x4xf32>) {
^bb0(%in: bf16, %in_19: bf16, %out: f32):
%20 = arith.extf %in : bf16 to f32
%21 = arith.extf %in_19 : bf16 to f32
%22 = arith.mulf %20, %21 : f32
%23 = arith.addf %out, %22 : f32
linalg.yield %23 : f32
} -> tensor<1x1x16x16x4x4xf32>
So the body of the linalg.generic
contains those bf16->f32
conversion of the inputs.
On further look I see that this is not because of vectorization pass, because the input
linalg.generic
to the vectorization part is :-%19 = linalg.generic ins(%pack_15, %pack_17 : tensor<1x1x8x16x4x8xbf16>, tensor<1x1x16x8x8x4xbf16>) outs(%18 : tensor<1x1x16x16x4x4xf32>) { ^bb0(%in: bf16, %in_19: bf16, %out: f32): %20 = arith.extf %in : bf16 to f32 %21 = arith.extf %in_19 : bf16 to f32 %22 = arith.mulf %20, %21 : f32 %23 = arith.addf %out, %22 : f32 linalg.yield %23 : f32 } -> tensor<1x1x16x16x4x4xf32>
So the body of the
linalg.generic
contains thosebf16->f32
conversion of the inputs.
Ouch, any ideas on getting rid of that upcast?
Ouch, any ideas on getting rid of that upcast?
Another temporary pass for starters. :-P But we shouldn't do that.
There are e2e tests in CI that uses bf16
inputs and f32
output for non-objectfifo pack-peel (and vectorization is switched on by default). So, I'll need to inspect that e2e IR to comment here more though.
But just so that we aren't chasing any red-herrings here, I take it that the func.func
change in this e2e IR (I shared in the above thread) looks okay now ?
It does linearization + the adjustments to the bf16
(2 bytes) -> i32
(4 bytes) and is linking to the corresponding bf16
objectfifos too.
But just so that we aren't chasing any red-herrings here, I take it that the
func.func
change in this e2e IR (I shared in the above thread) looks okay now ?
The offsets still look weird:
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 1][1, 2, 64, 64][1, 16384, 256]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj0}
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 1][1, 2, 64, 64][1, 64, 128]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj1}
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 65][1, 2, 64, 64][1, 16384, 256]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj0}
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 8193][1, 2, 64, 64][1, 64, 128]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj1}
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 129][1, 2, 64, 64][1, 16384, 256]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj0}
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 16385][1, 2, 64, 64][1, 64, 128]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj1}
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 193][1, 2, 64, 64][1, 16384, 256]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj0}
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 24577][1, 2, 64, 64][1, 64, 128]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj1}
aiex.npu.dma_memcpy_nd(0, 0, %arg2[0, 0, 0, 2][1, 1, 128, 128][1, 1, 128]) {id = 0 : i64, metadata = @obj10} : memref<8192xi32>
How do we get odd values (1/65/8193) etc? It should just be the offset before / 2
?
It seems to have yet another issue because this is the loop nest after vectorization :-
scf.for %arg5 = %c0 to %c16 step %c1 { scf.for %arg6 = %c0 to %c16 step %c1 { scf.for %arg7 = %c0 to %c8 step %c1 { %27 = vector.transfer_read %alloc_1[%c0, %c0, %arg7, %arg5, %c0, %c0], %cst {in_bounds = [true, true, true, true, true, true]} : memref<1x1x8x16x4x8xbf16, 2 : i32>, vector<1x1x1x1x4x8xbf16> %28 = vector.transfer_read %alloc[%c0, %c0, %arg6, %arg7, %c0, %c0], %cst {in_bounds = [true, true, true, true, true, true]} : memref<1x1x16x8x8x4xbf16, 2 : i32>, vector<1x1x1x1x8x4xbf16> %29 = vector.transfer_read %alloc_4[%arg3, %arg4, %arg6, %arg5, %c0, %c0], %cst_0 {in_bounds = [true, true, true, true, true, true]} : memref<2x2x16x16x4x4xf32, 2 : i32>, vector<1x1x1x1x4x4xf32> %30 = arith.extf %27 : vector<1x1x1x1x4x8xbf16> to vector<1x1x1x1x4x8xf32> %31 = arith.extf %28 : vector<1x1x1x1x8x4xbf16> to vector<1x1x1x1x8x4xf32> %32 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"], kind = #vector.kind<add>} %30, %31, %29 : vector<1x1x1x1x4x8xf32>, vector<1x1x1x1x8x4xf32> into vector<1x1x1x1x4x4xf32> vector.transfer_write %32, %alloc_4[%arg3, %arg4, %arg6, %arg5, %c0, %c0] {in_bounds = [true, true, true, true, true, true]} : vector<1x1x1x1x4x4xf32>, memref<2x2x16x16x4x4xf32, 2 : i32> } } }
This is not an issue, and AIEVec can handle it correctly. It combines arith.extf
and vector.contract
ops and generates aievec.matmul
op with bf16 inputs and f32 output.
You can refer to this test https://github.com/Xilinx/mlir-aie/blob/54efffaa12dd4f0cb3cebbb7dbfa51bf78dc74f8/test/Conversion/VectorToAIEVec/test-contract.mlir#L106
Hi.
arith.extf
op due to new accumulator type -distribute-core-and-objectfifo
lower-to-aie
func.func
didn't have the size/stride metadata adjusted as well - so I added that too - lower-to-aie
Here's the current func.func
:-
func.func @matmul_i8_i32_dispatch_0_matmul_128x128x256_bf16xbf16xf32(%arg0: memref<16384xi32>, %arg1: memref<16384xi32>, %arg2: memref<16384xi32>) {
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][0, 2, 64, 32][1, 8192, 128]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj0}
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][0, 2, 64, 32][1, 32, 64]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj1}
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 64][0, 2, 64, 32][1, 8192, 128]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj0}
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 8192][0, 2, 64, 32][1, 32, 64]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj1}
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 128][0, 2, 64, 32][1, 8192, 128]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj0}
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 16384][0, 2, 64, 32][1, 32, 64]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj1}
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 192][0, 2, 64, 32][1, 8192, 128]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj0}
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 24576][0, 2, 64, 32][1, 32, 64]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj1}
aiex.npu.dma_memcpy_nd(0, 0, %arg2[0, 0, 0, 0][0, 0, 128, 64][1, 1, 64]) {id = 0 : i64, metadata = @obj10} : memref<16384xi32>
aiex.npu.dma_wait {symbol = @obj10}
return
}
NOTE: In the above snippet one more thing to observe is the OUTPUT
, since we're using f32
accumulator now - the linearizing logic bailed out since the bitwidth is 32 and was creating a func.func (<16384xi32>,<16384xi32>,<128x128xf32>)
which seemed wrong to me.
Therefore I enforced linearization even if it is 32 bit && is NOT an IntegerType to get the above snippet.
On trying to generate a .vmfb
from the above I got the following in AIEAssignBufferAddresses
:-
'aie.tile' op allocated buffers exceeded available memory
Here is the IR log : e2e IR log
Adjusted the tiling/packing size of level 0 to -> <tile_sizes = [[64, 64], .....>
and packedSizes = [32, 32, 32] ...
but this failed in AIEObjectFifoStatefulTransform
(CC: @jtuyls )
Here is the IR log of the updated tiling/packing level 0 : e2e IR new tiling/packing
Changes currently in the branch : avarma_test_emulator
CC: @MaheshRavishankar @jtuyls @yzhang93
Hi.
- Added fix for
arith.extf
op due to new accumulator type -distribute-core-and-objectfifo
- After discussing with @jtuyls I've added a fix for the offsets in the current revision -
lower-to-aie
- Yesterday's
func.func
didn't have the size/stride metadata adjusted as well - so I added that too -lower-to-aie
Here's the current
func.func
:-func.func @matmul_i8_i32_dispatch_0_matmul_128x128x256_bf16xbf16xf32(%arg0: memref<16384xi32>, %arg1: memref<16384xi32>, %arg2: memref<16384xi32>) { aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][0, 2, 64, 32][1, 8192, 128]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32> aiex.npu.dma_wait {symbol = @obj0} aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][0, 2, 64, 32][1, 32, 64]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32> aiex.npu.dma_wait {symbol = @obj1} aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 64][0, 2, 64, 32][1, 8192, 128]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32> aiex.npu.dma_wait {symbol = @obj0} aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 8192][0, 2, 64, 32][1, 32, 64]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32> aiex.npu.dma_wait {symbol = @obj1} aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 128][0, 2, 64, 32][1, 8192, 128]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32> aiex.npu.dma_wait {symbol = @obj0} aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 16384][0, 2, 64, 32][1, 32, 64]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32> aiex.npu.dma_wait {symbol = @obj1} aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 192][0, 2, 64, 32][1, 8192, 128]) {id = 0 : i64, metadata = @obj0} : memref<16384xi32> aiex.npu.dma_wait {symbol = @obj0} aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 24576][0, 2, 64, 32][1, 32, 64]) {id = 0 : i64, metadata = @obj1} : memref<16384xi32> aiex.npu.dma_wait {symbol = @obj1} aiex.npu.dma_memcpy_nd(0, 0, %arg2[0, 0, 0, 0][0, 0, 128, 64][1, 1, 64]) {id = 0 : i64, metadata = @obj10} : memref<16384xi32> aiex.npu.dma_wait {symbol = @obj10} return }
NOTE: In the above snippet one more thing to observe is the
OUTPUT
, since we're usingf32
accumulator now - the linearizing logic bailed out since the bitwidth is 32 and was creating afunc.func (<16384xi32>,<16384xi32>,<128x128xf32>)
which seemed wrong to me. Therefore I enforced linearization even if it is 32 bit && is NOT an IntegerType to get the above snippet.On trying to generate a
.vmfb
from the above I got the following inAIEAssignBufferAddresses
:-'aie.tile' op allocated buffers exceeded available memory
Here is the IR log : e2e IR log
Adjusted the tiling/packing size of level 0 to ->
<tile_sizes = [[64, 64], .....>
andpackedSizes = [32, 32, 32] ...
but this failed inAIEObjectFifoStatefulTransform
(CC: @jtuyls ) Here is the IR log of the updated tiling/packing level 0 : e2e IR new tiling/packingChanges currently in the branch : avarma_test_emulator
CC: @MaheshRavishankar @jtuyls @yzhang93
@Abhishek-Varma I tried to replicate this error by executing the amdaie-objectFifo-stateful-transform
pass on the IR right before it and for me the pass succeeded. Could you try running this as well with latest iree-amd-aie?
Command:
${IREE_BUILD_DIR}/tools/iree-opt matmul_vec.mlir --mlir-print-ir-before-all --amdaie-objectFifo-stateful-transform
with matmul_vec.mlir
containing the IR dump right before AIEObjectFifoStatefulTransform
in the snippet you shared: https://gist.githubusercontent.com/Abhishek-Varma/90fbe66ec4aabb5a3da410885615c5f3/raw/184b9eac151d4c20a54a4a73d30904f29c6fa597/input.mlir
Gist: https://gist.github.com/jtuyls/0d46284d9d3c5780bd298ad7de4d88a3
We already have an e2e Matmul working with the in-flight Objectfifo backend. The current in-flight branch being maintained by @jtuyls is https://github.com/nod-ai/iree-amd-aie/tree/jornt_cpp_pipeline.
Currently we're trying to support the same but with the vectorization switched on. I'm maintaining that on top of @jtuyls 's branch : https://github.com/nod-ai/iree-amd-aie/tree/avarma_matmul_elem
Issues :-
insert-core
pass fixed by : Fix of InsertCore for vectorizationIREEComprehensiveBufferize
but fixed by commenting--canonicalize
distribute-core-and-objectfifo-pass
fixed by : Fix of DistributeCoreAndObjectFifo for vectorizationThe current IR log state : e2e Matmul + Vectorization + Objectfifo