Closed newling closed 3 days ago
This is actually pretty simple to do, and it currently compiles fine (if I don't try and unpeel the epilogue). But there is numerical error, which I haven't managed to understand yet.
This is pretty simple because it's hardcoded to a specific case:
- Looking for prolog - main - epilog. What if peeled 2, 3, 4 times etc?
Easily extends to that case.
- Assumes the prolog, main and epilog are the same ops, that can be unpeeled. This is not generic as it doesn't proof that the generic op before the main is actually the prolog of the main (accesses could be different from main). The same issue exists for the epilog.
I am aware of these issues, there is no need to shoot this down while it is still a proof-of-concept. There are tools in MLIR to check for block equivalence (which function outlining of Abhishek's already uses). FWIW the current function outlining is isn't general, hardcoded for matmul.
Also, could you argue why we need this? We discussed this before and we said we didn't need it as it would also not really solve the PM issue, just make it a slightly smaller problem and there are other simpler and more robust ways to tackle the PM issues.
See https://github.com/nod-ai/iree-amd-aie/pull/899#discussion_r1841016406
- Assumes the prolog, main and epilog are the same ops, that can be unpeeled. This is not generic as it doesn't proof that the generic op before the main is actually the prolog of the main (accesses could be different from main). The same issue exists for the epilog.
I am aware of these issues, there is no need to shoot this down while it is still a proof-of-concept. There are tools in MLIR to check for block equivalence (which function outlining of Abhishek's already uses). FWIW the current function outlining is isn't general, hardcoded for matmul.
I think this is a conceptual issue so worth bringing up before we spend too much time on this. If the purpose of POC is to measure the PM reduction potential, then maybe, but that's not what's described in this PR. If it's meant to actually be merged as a pass, it needs to be really generic or it will lead to all kinds of issues. I think making this fully generic is not worth the effort, certainly not at this time, because I don't see much benefit, see below:
Also, could you argue why we need this? We discussed this before and we said we didn't need it as it would also not really solve the PM issue, just make it a slightly smaller problem and there are other simpler and more robust ways to tackle the PM issues.
See #899 (comment)
I am assuming you are referring to:
Finally I should mention what the advantage of this approach over function outlining: there will be no performance hit caused by function calls.
I think this is a limited benefit at this point and I don't think the function call overhead will soon be the bottleneck. One data point on that is that the ukernel approach (which has function calls) performs significantly better than the vectorization approach w/o function calls.
If the purpose of POC is to measure the PM reduction potential
It's a POC in the sense that people said in the initial meeting where we discussed this that it couldn't be done, because of something to do with not being able to fuse the fill. But it can be done, that's what this POC is a "proof" of.
If it's meant to actually be merged as a pass, it needs to be really generic or it will lead to all kinds of issues.
I'm sorry but you are applying a double standard here -- many of the passes in the tiling pipeline are designed only to work together for a peeled matmul. But that's secondary -- I've explicitly stated in the summary that this will be generalized once people get the basic idea of what the pass does.
I think making this fully generic is not worth the effort, certainly not at this time, because I don't see much benefit, see below:
The benefit is the same as function outlining (do you agree?) So if by "not much benefit" you mean over and above function outlining, then yes you are right, definitely not additive benefits. The motivation for trying this again was the apparent overhead of function outlining
I think this is a limited benefit at this point and I don't think the function call overhead will soon be the bottleneck. One data point on that is that the ukernel approach (which has function calls) performs significantly better than the vectorization approach w/o function calls.
This is a good point. I wonder why the function call overhead of ukernel is lower than with our function outlining. All I can think -- do we definitely know that we've used ukernel with multiple call sights, and seen good performance?
It's a POC in the sense that people said in the initial meeting where we discussed this that it couldn't be done, because of something to do with not being able to fuse the fill. But it can be done, that's what this POC is a "proof" of.
People in the initial meeting were thinking of doing this in general and that it is complicated/hard (no one said impossible) and not worth the effort. This PR doesn't proof that that is easy because it makes very big assumptions so that it becomes a simple problem.
I'm sorry but you are applying a double standard here -- many of the passes in the tiling pipeline are designed only to work together for a peeled matmul. But that's secondary -- I've explicitly stated in the summary that this will be generalized once people get the basic idea of what the pass does.
Well, please list those passes that can only work on a peeled matmul and will break otherwise.
This is not the same, but there are passes that will only work in limited circumstances, however, those should check for and only be activated in exactly those circumstances. The logic in this PR is not on the contrary and will have incorrect output for IR that is not expected, that's a very different issue, not a double standard.
The benefit is the same as function outlining (do you agree?)
For peeled matmul, more or less (unpeeling probably has slightly less code, although minimal).
But not necessarily in general as function outlining can have more benefit for example peeled matmul + relu + matmul
.
So if by "not much benefit" you mean over and above function outlining, then yes you are right, definitely not additive benefits. The motivation for trying this again was the apparent overhead of function outlining
This is what's most important. Since function outlining is much simpler, we shouldn't need to look at unpeeling anytime soon. Again, fine if it's just a POC to measure/try something, but not useful imo to keep looking at.
This is a good point. I wonder why the function call overhead of ukernel is lower than with our function outlining. All I can think -- do we definitely know that we've used ukernel with multiple call sights, and seen good performance?
There is a table in the doc that compares the ukernel approach with vectorization, showing significant difference. Unless you think the ukernel is inlined after all, which it shouldn't because 1) there is no inline on those functions 2) it's compiled separately through chess, so I don't see how that would work. Also, deriving from prior work, there shouldn't really be that much overhead from function calls and that's really the last of the last mile.
Ok, still a few points I don't fully agree with but I do totally agree that if function calls from outlining should have negligeable overhead then that's what we should work towards. And so I'll stop debating inconsequential things!
There is a table in the doc that compares the ukernel approach with vectorization
Cool, I'll get the ukernel into 'Performance benchmarks' in CI (https://github.com/nod-ai/iree-amd-aie/actions/runs/11864873206/job/33069601465?pr=906)
Closing for now, as if we can eliminate the overhead of the function call with function outlining then this pass is redundant.
This is actually pretty simple to do, and it currently compiles fine (if I don't try and unpeel the epilogue). But there is numerical error, which I haven't managed to understand yet.
If you're looking at this I recommend taking a look at the .mlir file -- it should be clear what I'm trying to do in this pass, which matmul I'm trying to fuse into the loop.
This is a WIP, something more rigorous is definitely needed!
Before unpeeling: before_unpeel.txt
After unpeeling:
after_unpeel.txt
IR (module scope) before all with unpeeling: with_unpeel_before_all.txt
All that changes is that the very first matmul, which is before the loop, goes into the loop (and the loop count goes up).