Closed adam-smnk closed 3 months ago
Why is this restricted to max and not just any element-wise op?
I think this could be easily relax to all named ops. Processing generics is also doable but hassle. All in all, I just didn't want to bother for now.
Fill folding together with broadcast folding finally improves named ops benchmarks.
The remaining slowdown is caused by one more temporary buffer allocation. It is most likely caused by the folded max generic which is not in in-place format (outs to tensor.empty). The next step is to rewrite the max generic in the same way as we do with linalg-convert-add-in-place
for generic adds.
Looks like the two folders could be merged, it just requires more testing to see if I missed any edge cases. I'll follow up on that a bit later.
Adds pattern that folds linalg.fill into linalg.max and outputs combined linalg.generic.
A constant filled buffer is replaced by a single constant used directly in max operation on the elements of the other operand. This allows to eliminate potential temporary buffer allocation and value initialization.