Closed victor-eds closed 3 days ago
WIP. Should be ready soonish.
Code ready to push. Evaluating whether this is needed after all. Will push PR or close this issue as won't fix this week.
Not needed. The EW optimization pass isn't needed as we can modify previous passes so optimal layouts are propagated.
-tritonintelgpu-optimize-elementwise-parallelism
introduces "unbroadcast" layout conversions. These should be total NOPs as simply involve dropping some values held by multiple threads, e.g., going from:to:
Note for a
tensor<16xf32>
, this would mean going from:To: