nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
45 stars 23 forks source link

[AMD-AIE] Deprecate distribute-core-and-objectfifo pass #452

Closed Abhishek-Varma closed 1 week ago

Abhishek-Varma commented 1 week ago

This PR is an attempt to deprecate --iree-amdaie-distribute-core-and-objectfifo pass.

Reason: A lot of IR modification takes place within the same pass in an atomic way. All the individual phases deserve their own set of passes. Therefore we'd ideally like to have a clean distinction between the individual phases that reflects in mlir-print-ir-before/after-all that helps us to debug and by that extension develop/maintain the passes relatively better.

With this pass we will have the following new passes now (each in their separate commit along with incremental changes made to eventually deprecate iree-amdaie-distribute-core-and-objectfifo pass) :-

  1. iree-amdaie-local-forall-to-for
  2. iree-amdaie-distribute-local-memory
  3. iree-amdaie-unroll-local-loops
  4. iree-amdaie-insert-logical-objectfifo-access
  5. iree-amdaie-assign-tile-locations
  6. iree-amdaie-distribute-shared-memory

Besides the above, changes were made to matmul_peeled_objectfifo.mlir lit test to incorporate above set of passes (in-order) instead.

Also, changes were made to CMakeLists.txt of Transforms and Transforms/test - this just helps any of us to not have to nit-pick on adding new passes/lit test files every time - just create a new pass and the lit test - it should work. (I currently have added this as 2 separate commits but after consensus will squash into one).

NOTE for reviewers:

  1. It'd be nice to review the lit tests of each passes.
  2. I haven't added a "descriptive" commit message for each pass' commit - will do that towards the end when we have consensus on each pass' name.
  3. Based on point 2 - I've not yet arranged the passes within Passes.td in alphabetical order.
  4. The above changes collectively have been tested on the e2e Matmul to reach to AIE dialect successfully - will test the .vmfb after addressing comments on this PR and get in the other fixes for pure Matmul + Vectorization + Objectfifo incrementally (each fix in their own separate PR).
Abhishek-Varma commented 1 week ago

I don't think we should pull this pass apart as the IR in between each of those stages isn't stable, for example doing a cse in between would break it. These new passes are really only useful if executed back-to-back in the exact order iree-amdaie-distribute-core-and-objectfifo had, which I think demonstrates that they are coupled too tightly to be pulled apart.

You're correct - but I don't think we would want to invoke a --cse between these individual passes though while we simultaneously work out a better way to rewrite the pass as well as carry out fixes in the individual passes.

Indeed the six passes, for now, will have to be invoked back-to-back (as demonstrated in the matmul_peeled_objectfifo.mlir as well as I tried the e2e via the WIP C++ pipeline locally).

The main motive is to help making debugging relatively easier - especially w.r.t iree-amdaie-distribute-local-memory, iree-amdaie-insert-logical-objectfifo-access and iree-amdaie-assign-tile-locations - because currently any crash at the bigger pass basically means manually looking at intermediate IRs (the --debug is quite noisy therefore the main goal was to help with the mlir-print-ir-before/after-all as we shouldn't have to look at unrolling of the loops, etc that's taking place within distribute-core-and-objectfifo).

The iree-amdaie-distribute-core-and-objectfifo is indeed overly complicated but I think this is mainly caused by the global memref usage across (possibly) peeled stages of the computation and the need to keep this global memref information around while also distributing some of them on both local and shared memory (different cores and memtiles). Once we figure out how to best do that, I think the pass can be split into (roughly):

  1. Unroll local for loops
  2. Distribute memrefs
  3. Assign AIE tile locations

Since we plan on segregating distribute-core-and-objectfifo eventually, I believe we can perhaps use this PR to stage it well as it isolates the phases involved (albeit tightly coupled) and then work our way back to make it concise to be just the three set of passes you mentioned ?

Let me know your thoughts.

jtuyls commented 1 week ago

The main motive is to help making debugging relatively easier ... (the --debug is quite noisy therefore the main goal was to help with the mlir-print-ir-before/after-all as we shouldn't have to look at unrolling of the loops, etc that's taking place within distribute-core-and-objectfifo

For non-trivial changes/debugging you will have to look at multiple of these stages and how they influence each other as they are tightly coupled so I don't think separating the passes will help much and imo it's not a reason enough to pull the pass apart. For example seeing an issue in iree-amdaie-distribute-shared-memory likely means having to change/fix the transformations above.

Since we plan on segregating distribute-core-and-objectfifo eventually, I believe we can perhaps use this PR to stage it well as it isolates the phases involved (albeit tightly coupled) and then work our way back to make it concise to be just the three set of passes you mentioned ?

I don't see how pulling the pass apart will help with the rewrite as I think the rewrite should go in an entirely different direction and I hope most logic of these passes will just have to be deleted. I think natural segregation happens when the intermediate IR is stable and we should not do it before we get there.

Abhishek-Varma commented 1 week ago

@jtuyls and I had a discussion offline.

So since the current pass is stable, pulling it out currently would make it unstable w.r.t the intermediate IRs as the IR at each stage is dependent on the previous stage.

The pass will definitely need a rewrite though and only then it should be pulled out into the three passes as Jorn mentioned in the above thread.

I'm closing this PR.

yzhang93 commented 1 week ago

@jtuyls and I had a discussion offline.

So since the current pass is stable, pulling it out currently would make it unstable w.r.t the intermediate IRs as the IR at each stage is dependent on the previous stage.

The pass will definitely need a rewrite though and only then it should be pulled out into the three passes as Jorn mentioned in the above thread.

I'm closing this PR.

Yeah, it makes sense to me. Sorry, I clicked the wrong button and reopened the PR.

Can you start to send a PR for the fix we had for vectorization? I think we can first get in the changes for InsertCores and DistributeCores passes.