Closed victor-eds closed 3 days ago
To be done in this iteration. Haven't started.
Code ready to push. Evaluating whether this is needed after all. Will push PR or close this issue as won't fix this week.
Not needed as previous passes (like the optimize reduction) can be modified so optimal layouts are propagated.
-tritonintelgpu-optimize-elementwise-parallelism
has a small limitation:scf.for
block arguments are not optimized.If we have a "broadcasted" tensor acting as a block argument, this will have a very high impact in register pressure. Optimize
scf.for
block arguments in a similar way to elementwise operations operands.