Open silvasean opened 1 year ago
Makes sense. We shouldnt be fusing that. Instead of https://github.com/openxla/iree/blob/1fd449b7b55f87d335ec67666499bb09aedf10f9/compiler/src/iree/compiler/Dialect/Flow/Transforms/FusionOfTensorOps.cpp#L86 we should be fusing in the consumer only if the indexing map in the use is a permutation (apart from the broadcast case).
Do you have the input IR. Should be a simple fix.
Yep, the input IR is here: https://gist.github.com/silvasean/e18c111db26699a6f18acc6037d5a00a
This is fixed by #13308 but landing that is blocked by #13189
Thanks @MaheshRavishankar. That looks like it avoids some of the "undesirable" fusions and improves the runtime to ~600ms from the ~6 seconds it was before. It looks like there is still 20x to go to reach XLA:GPU's 30ms baseline.
I'll be happy to land this... but it is blocked on downstream issues. I have tagged all of them here....
Quick update: still blocked by #13189
The fusion issue is fixed by #13308 . The performance issue should be fixed at ToT (by https://github.com/openxla/iree/pull/13468 ) . Please verify and close.
With those fixes, this meso-benchmark now takes 65ms on IREE which is still over 2x off from the 30ms from XLA:GPU, so I think it makes sense to keep this issue open as there is more work to be done to reach parity.
Could we re-title it. Also dropping myself from assignee list and moving it to Sean.
Changed title. I want to emphasize that the work to be done here is to optimize it to parity, rather than any particular fix. Next step is probably for me to dive in and do a first-order performance analysis to identify remaining gaps.
TODO: Need to re-evaluate how big of a fraction of the e2e workload this is now after these fixes have landed.
What happened?
The IR below probably has a couple things going wrong in it that we will need to peel apart piece by piece. This benchmark takes 30ms on XLA:GPU but on IREE just the einsum dispatch takes over 6 seconds, so the overall compuation is >200x slower.
This is extracted from Python code which might be easier to read for some folks (link). In particular there is an einsum which seems to be the source of some problems (this config uses the "outgoing" einsum equation).
The IR snippet is here and in particular the einsum is here. If you want to run this snippet you can use a command line like:
With this flagfile and ir34.linalg.mlir taken from the link above.
The first thing I notice when I look at the slow dispatch that is taking 6 seconds is that it consists of this linalg.generic
This looks like a really poor fusion decision. Looking at the dispatch graph dumped by
--iree-flow-dump-dispatch-graph
(dot), it looks like we have basically fused into the input of the einsum various things: biasadds from previous linear layers, sigmoids (themath.exp
's) and some elementwise multiplications. This results in significant recomputation.That's probably enough to get started on working on this. I will post updates as I dig into other aspects here.
Steps to reproduce your issue
See above
What component(s) does this issue relate to?
Compiler
Version information
iree.git @ ab37989652aed11f7f46498c09b9ac515c83eaa3
Additional context
No response