Closed yzhang93 closed 2 weeks ago
Currently 5 bf16 ci tests failed because of [AIE ERROR] _XAie_LoadProgMemSection():231: Overflow of program memory.
All i32 and i8 tests passed.
@Abhishek-Varma Did you see the similar error? How do you solve it?
Currently 5 bf16 ci tests failed because of
[AIE ERROR] _XAie_LoadProgMemSection():231: Overflow of program memory.
All i32 and i8 tests passed.@Abhishek-Varma Did you see the similar error? How do you solve it?
Yes, I'm seeing this issue for larger shape Matmul + Truncf bf16.
From what I could speculate yesterday, in Matmul + Truncf's case, is that it seemed to be due to vector unrolling of arith.truncf
. But in this CI I see that it's happening now for Batch Matmul too.
Trying to play with Peano flags today to see if it helps.
Since both of us are facing the same issue - I tried looking into the CI failure (as the current speculation was that arith.truncf
's vector unrolling is the culprit - so wanted to check what the delta here is, in case it helps solve both of our cases).
I was trying to compare the e2e IR for the case of PM issue (Program Memory issue) with the one that doesn't use this patch - to see what the delta is.
The only difference (besides the new tile size) that I could see was in the MAIN
section's scf.for
total iteration count.
For the case with PM issue the total iteration count is 2
whereas for the one that doesn't has 6
.
And due to this AMDAIEAcquireReleaseToUseLock
, when trying to unroll, seems to be changing the IR structure from :-
core {
// PROLOGUE
vector.contract
// MAIN
scf.for
vector.contract
// EPILOGUE
vector.contract
}
To
core {
// PROLOGUE
vector.contract
// MAIN
vector.contract
vector.contract
// EPILOGUE
vector.contract
}
But I don't think this is the cause of PM exhaustion - when trying to understand the same issue for Truncf as well.
I tried changing the Peano flags too :-
-O2 \
--inline-threshold=10 \
-S input.ll \
--disable-builtin=memset \
-o input.opt.ll \
-vectorize-loops=false \
-vectorize-slp=false \
--two-entry-phi-node-folding-threshold=10 \
-mandatory-inlining-before-opt=false \
-basic-aa-full-phi-analysis=true \
-basic-aa-max-lookup-search-depth=10
but none seem to work for both this as well as the Truncf PM issue. :/
I even tried re-looking at the changes done in https://github.com/nod-ai/iree-amd-aie/pull/822 (in case any of that was a culprit common to us) - but those changes seem to not touch at least this PR's delta.
@jtuyls suggested to add --disable-loop-unrolling
in the Peano flag and it worked!
Here is the e2e IR of one of the test's failure with your patch - I verified the output too.
It also solved the PM issue for Truncf!
Thanks Jorn!
@jtuyls suggested to add
--disable-loop-unrolling
in the Peano flag and it worked!Here is the e2e IR of one of the test's failure with your patch - I verified the output too.
It also solved the PM issue for Truncf!
Thanks Jorn!
@Abhishek-Varma Thanks for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?
s for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?
Maybe we can check whether it has any performance impact? At some point, this might matter a lot, but not sure at this point. We could make it a an optional flag for now? And potentially, we can unroll/partially unroll on our side?
s for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?
Maybe we can check whether it has any performance impact? At some point, this might matter a lot, but not sure at this point. We could make it a an optional flag for now? And potentially, we can unroll/partially unroll on our side?
Unfortunately, it does have performance impact. For large size such as 4096x4096x2048, the total execution time doubles while disabling the loop unrolling in peano :( (Comparison is without any change of tile sizes in this PR.)
@Abhishek-Varma After rebasing on your outlining patches the PM issues are solved. Thanks for that.
However, now all the ukernel tests are failed. Please take a look and let me know if we'd want to keep the original tile sizes for ukernels.
@Abhishek-Varma After rebasing on your outlining patches the PM issues are solved. Thanks for that.
However, now all the ukernel tests are failed. Please take a look and let me know if we'd want to keep the original tile sizes for ukernels.
@yzhang93 with the mm.cc
delta in https://github.com/nod-ai/iree-amd-aie/pull/878 (basically added mm.cc
changes on top of this PR) the compilation failure in the CI is resolved but now the .vmfb
is generating incorrect results. It'd be nice to have this working with the new L1 sizes itself if it's an issue of incorrect results. CC: @erwei-xilinx could you please help take a look ? Previously we were generating a block of 32 x 32 x 32
(M x N x K), but now we're generating 32 x 32 x 64
. I tried looking into the raw ukernel but the constraints it expects seems to be satisfied.
NOTE: The incorrect result is yielded even without this commit on the PR branch I shared.
@yzhang93 with the
mm.cc
delta in #878 (basically addedmm.cc
changes on top of this PR) the compilation failure in the CI is resolved but now the.vmfb
is generating incorrect results. It'd be nice to have this working with the new L1 sizes itself if it's an issue of incorrect results. CC: @erwei-xilinx could you please help take a look ? Previously we were generating a block of32 x 32 x 32
(M x N x K), but now we're generating32 x 32 x 64
. I tried looking into the raw ukernel but the constraints it expects seems to be satisfied. NOTE: The incorrect result is yielded even without this commit on the PR branch I shared.
I added a comment in https://github.com/nod-ai/iree-amd-aie/pull/878. Don't know if that's the reason for this numerical failure.
@yzhang93 with the
mm.cc
delta in #878 (basically addedmm.cc
changes on top of this PR) the compilation failure in the CI is resolved but now the.vmfb
is generating incorrect results. It'd be nice to have this working with the new L1 sizes itself if it's an issue of incorrect results. CC: @erwei-xilinx could you please help take a look ? Previously we were generating a block of32 x 32 x 32
(M x N x K), but now we're generating32 x 32 x 64
. I tried looking into the raw ukernel but the constraints it expects seems to be satisfied. NOTE: The incorrect result is yielded even without this commit on the PR branch I shared.I added a comment in #878. Don't know if that's the reason for this numerical failure.
Thank you @erwei-xilinx ! The numerical issue is resolved.
@yzhang93 can you please cherry pick the mm.cc
delta from the attached PR link into this PR's branch? This should unblock you.
@yzhang93 with the
mm.cc
delta in #878 (basically addedmm.cc
changes on top of this PR) the compilation failure in the CI is resolved but now the.vmfb
is generating incorrect results. It'd be nice to have this working with the new L1 sizes itself if it's an issue of incorrect results. CC: @erwei-xilinx could you please help take a look ? Previously we were generating a block of32 x 32 x 32
(M x N x K), but now we're generating32 x 32 x 64
. I tried looking into the raw ukernel but the constraints it expects seems to be satisfied. NOTE: The incorrect result is yielded even without this commit on the PR branch I shared.I added a comment in #878. Don't know if that's the reason for this numerical failure.
Thank you @erwei-xilinx ! The numerical issue is resolved.
@yzhang93 can you please cherry pick the
mm.cc
delta from the attached PR link into this PR's branch? This should unblock you.
Thanks @Abhishek-Varma and @erwei-xilinx!
Increase the K tile size to make more usage of L1 memory. Note his solution is temporary, I'm thinking about a more sophisticated way to select tile sizes which need a major refactor.