nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
69 stars 30 forks source link

Increase the K tile size in L1 for matmul ops #846

Closed yzhang93 closed 2 weeks ago

yzhang93 commented 1 month ago

Increase the K tile size to make more usage of L1 memory. Note his solution is temporary, I'm thinking about a more sophisticated way to select tile sizes which need a major refactor.

yzhang93 commented 1 month ago

Currently 5 bf16 ci tests failed because of [AIE ERROR] _XAie_LoadProgMemSection():231: Overflow of program memory. All i32 and i8 tests passed.

@Abhishek-Varma Did you see the similar error? How do you solve it?

Abhishek-Varma commented 1 month ago

Currently 5 bf16 ci tests failed because of [AIE ERROR] _XAie_LoadProgMemSection():231: Overflow of program memory. All i32 and i8 tests passed.

@Abhishek-Varma Did you see the similar error? How do you solve it?

Yes, I'm seeing this issue for larger shape Matmul + Truncf bf16.

From what I could speculate yesterday, in Matmul + Truncf's case, is that it seemed to be due to vector unrolling of arith.truncf. But in this CI I see that it's happening now for Batch Matmul too. Trying to play with Peano flags today to see if it helps.

Abhishek-Varma commented 1 month ago

Since both of us are facing the same issue - I tried looking into the CI failure (as the current speculation was that arith.truncf's vector unrolling is the culprit - so wanted to check what the delta here is, in case it helps solve both of our cases).

I was trying to compare the e2e IR for the case of PM issue (Program Memory issue) with the one that doesn't use this patch - to see what the delta is.

The only difference (besides the new tile size) that I could see was in the MAIN section's scf.for total iteration count. For the case with PM issue the total iteration count is 2 whereas for the one that doesn't has 6. And due to this AMDAIEAcquireReleaseToUseLock, when trying to unroll, seems to be changing the IR structure from :-

core {
    // PROLOGUE
       vector.contract 
    // MAIN
    scf.for 
         vector.contract
    // EPILOGUE
       vector.contract
}

To

core {
    // PROLOGUE
       vector.contract 
    // MAIN
       vector.contract
       vector.contract
    // EPILOGUE
       vector.contract
}

But I don't think this is the cause of PM exhaustion - when trying to understand the same issue for Truncf as well.

I tried changing the Peano flags too :-

    -O2 \
    --inline-threshold=10 \
    -S input.ll \
    --disable-builtin=memset \
    -o input.opt.ll \
    -vectorize-loops=false \
    -vectorize-slp=false \
    --two-entry-phi-node-folding-threshold=10 \
    -mandatory-inlining-before-opt=false \
    -basic-aa-full-phi-analysis=true \
    -basic-aa-max-lookup-search-depth=10

but none seem to work for both this as well as the Truncf PM issue. :/

I even tried re-looking at the changes done in https://github.com/nod-ai/iree-amd-aie/pull/822 (in case any of that was a culprit common to us) - but those changes seem to not touch at least this PR's delta.

Abhishek-Varma commented 1 month ago

@jtuyls suggested to add --disable-loop-unrolling in the Peano flag and it worked!

Here is the e2e IR of one of the test's failure with your patch - I verified the output too.

It also solved the PM issue for Truncf!

Thanks Jorn!

yzhang93 commented 1 month ago

@jtuyls suggested to add --disable-loop-unrolling in the Peano flag and it worked!

Here is the e2e IR of one of the test's failure with your patch - I verified the output too.

It also solved the PM issue for Truncf!

Thanks Jorn!

@Abhishek-Varma Thanks for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?

jtuyls commented 1 month ago

s for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?

Maybe we can check whether it has any performance impact? At some point, this might matter a lot, but not sure at this point. We could make it a an optional flag for now? And potentially, we can unroll/partially unroll on our side?

yzhang93 commented 1 month ago

s for looking into the issue! So what's the way to move forward? We disable that flag in Peano or add some optimizations?

Maybe we can check whether it has any performance impact? At some point, this might matter a lot, but not sure at this point. We could make it a an optional flag for now? And potentially, we can unroll/partially unroll on our side?

Unfortunately, it does have performance impact. For large size such as 4096x4096x2048, the total execution time doubles while disabling the loop unrolling in peano :( (Comparison is without any change of tile sizes in this PR.)

yzhang93 commented 2 weeks ago

@Abhishek-Varma After rebasing on your outlining patches the PM issues are solved. Thanks for that.

However, now all the ukernel tests are failed. Please take a look and let me know if we'd want to keep the original tile sizes for ukernels.

Abhishek-Varma commented 2 weeks ago

@Abhishek-Varma After rebasing on your outlining patches the PM issues are solved. Thanks for that.

However, now all the ukernel tests are failed. Please take a look and let me know if we'd want to keep the original tile sizes for ukernels.

@yzhang93 with the mm.cc delta in https://github.com/nod-ai/iree-amd-aie/pull/878 (basically added mm.cc changes on top of this PR) the compilation failure in the CI is resolved but now the .vmfb is generating incorrect results. It'd be nice to have this working with the new L1 sizes itself if it's an issue of incorrect results. CC: @erwei-xilinx could you please help take a look ? Previously we were generating a block of 32 x 32 x 32 (M x N x K), but now we're generating 32 x 32 x 64. I tried looking into the raw ukernel but the constraints it expects seems to be satisfied. NOTE: The incorrect result is yielded even without this commit on the PR branch I shared.

erwei-xilinx commented 2 weeks ago

@yzhang93 with the mm.cc delta in #878 (basically added mm.cc changes on top of this PR) the compilation failure in the CI is resolved but now the .vmfb is generating incorrect results. It'd be nice to have this working with the new L1 sizes itself if it's an issue of incorrect results. CC: @erwei-xilinx could you please help take a look ? Previously we were generating a block of 32 x 32 x 32 (M x N x K), but now we're generating 32 x 32 x 64. I tried looking into the raw ukernel but the constraints it expects seems to be satisfied. NOTE: The incorrect result is yielded even without this commit on the PR branch I shared.

I added a comment in https://github.com/nod-ai/iree-amd-aie/pull/878. Don't know if that's the reason for this numerical failure.

Abhishek-Varma commented 2 weeks ago

@yzhang93 with the mm.cc delta in #878 (basically added mm.cc changes on top of this PR) the compilation failure in the CI is resolved but now the .vmfb is generating incorrect results. It'd be nice to have this working with the new L1 sizes itself if it's an issue of incorrect results. CC: @erwei-xilinx could you please help take a look ? Previously we were generating a block of 32 x 32 x 32 (M x N x K), but now we're generating 32 x 32 x 64. I tried looking into the raw ukernel but the constraints it expects seems to be satisfied. NOTE: The incorrect result is yielded even without this commit on the PR branch I shared.

I added a comment in #878. Don't know if that's the reason for this numerical failure.

Thank you @erwei-xilinx ! The numerical issue is resolved.

@yzhang93 can you please cherry pick the mm.cc delta from the attached PR link into this PR's branch? This should unblock you.

yzhang93 commented 2 weeks ago

@yzhang93 with the mm.cc delta in #878 (basically added mm.cc changes on top of this PR) the compilation failure in the CI is resolved but now the .vmfb is generating incorrect results. It'd be nice to have this working with the new L1 sizes itself if it's an issue of incorrect results. CC: @erwei-xilinx could you please help take a look ? Previously we were generating a block of 32 x 32 x 32 (M x N x K), but now we're generating 32 x 32 x 64. I tried looking into the raw ukernel but the constraints it expects seems to be satisfied. NOTE: The incorrect result is yielded even without this commit on the PR branch I shared.

I added a comment in #878. Don't know if that's the reason for this numerical failure.

Thank you @erwei-xilinx ! The numerical issue is resolved.

@yzhang93 can you please cherry pick the mm.cc delta from the attached PR link into this PR's branch? This should unblock you.

Thanks @Abhishek-Varma and @erwei-xilinx!