Closed newling closed 6 days ago
Hi @newling !
Thank you so much for this amazing find!
I've addressed the comments in https://github.com/nod-ai/iree-amd-aie/commits/avarma_james_func_outline/ :-
8ccc95196
- does the selective enabling of function outlining.3b72c54ef
- updates the lit tests for insert-loops-for-vectorization.Both these can be cherry-picked on this PR's branch newling:regression_fixes
(couldn't do it as I don't have access to the same) - would be nice to get this bug fix in from https://github.com/nod-ai/iree-amd-aie/pull/888 first though.
NOTE: For the selective function outlining - I thought of making changes by adding a pass flag - but that didn't seem the right thing to do as, given a random dispatch, it didn't seem quite generic enough. Since this would be invoked after bufferization, the use-def chain is not easy to traverse as everything is just memref
(so getDefiningOp() renders useless). Even thought of adding a change in KernelDispatch to figure out - but that turned out to be quite involed.
Therefore I'm just being vocal about the approach I've taken for this selective outlining for "Matmul + Truncf" - you may re-enable it in the pipeline as it was earlier.
@Abhishek-Varma Maybe now is a good time to generalize the logic for truncf to any unary elementwise operation?
@Abhishek-Varma and can you please make a new standalone PR for https://github.com/nod-ai/iree-amd-aie/commit/3b72c54ef654d631f9aa8d087b3d5cf768ee87e0 ?
This PR has 2 changes
(1) remove coalescing in vectorization pass (2) do not perform outlining
(1) and (2) were introduced in https://github.com/nod-ai/iree-amd-aie/pull/856
The effects of these 2 changes together and individually on current Top of Main (ToM) for matmul with m=n=512 k=4096:
ToM : 49 [ms] ToM - (1) : 39 [ms] ToM - (2) : 41 [ms] ToM - (1) - (2) : 17 [ms].
This PR needs work: some lit tests need to updated, and I think Abhishek mentioned that he wanted to keep outlining on for matmul+truncf (?)
What's up with coalescing causing a regression?
Here's my interpretation: Coalescing the loops results in the single index being decomposed as
With %43#0, %43#1, and %43#2 being used as offsets in the
vector.transfer_read
ops.This affine op eventually gets updated in
lower-affine
into about 20 arith ops, which later lower to llvm scalar ops. The full compute block is included below.This is the big difference I can see that the change to vectorization in https://github.com/nod-ai/iree-amd-aie/pull/856 introduces.
What's up with outlining causing a regression?
This is pretty obvious I guess : function call overhead.