Until now compiler would create single tilize block call with total amount of tiles needed to be generated. LLK actually needs to work on a single row of tiles at the time in order to know how much to stride.
This fix changes lowering to TTMetal dialect and creates correct number of tilize/untilize calls. In between them it pops/pushes from/to given CBs.
With SCF dialect we'll be able to insert loops, but for now these calls will be unrolled.
Until now compiler would create single tilize block call with total amount of tiles needed to be generated. LLK actually needs to work on a single row of tiles at the time in order to know how much to stride.
This fix changes lowering to TTMetal dialect and creates correct number of tilize/untilize calls. In between them it pops/pushes from/to given CBs.
With SCF dialect we'll be able to insert loops, but for now these calls will be unrolled.