Add control code scf.forall to scf.for pass

jtuyls commented 3 days ago

Adds a pass that converts control code scf.forall ops into scf.for ops. This enables more loop subsumption opportunities as scf.forall needs to be either fully subsumed into a DMA or not at all, while converting to scf.for allows the loop to be partially subsumed.

With this pass added, the 4096x512x512 matmul on 2x2 lowers to 6532 words of control code instead of 52228 before this pass. For this configuration with ukernel, the latency is now 10ms compared with 19ms earlier.

jtuyls commented 3 days ago

Question though. Might the order of the scf.for loops matter? For example, if the final test only had affine.apply #map(%l) might be it be better to have that for loop be the outer-most loop? Not sure if there's an easy way to do this analysis and subsequent loop inversion (and not sure if 'inversion' is the correct name)

Yes, it does certainly matter for different shapes, but I think this should be done at the tiling level instead.

yzhang93 commented 3 days ago

Question though. Might the order of the scf.for loops matter? For example, if the final test only had affine.apply #map(%l) might be it be better to have that for loop be the outer-most loop? Not sure if there's an easy way to do this analysis and subsequent loop inversion (and not sure if 'inversion' is the correct name)

Yes, it does certainly matter for different shapes, but I think this should be done at the tiling level instead.

Yes, we can control the order of loops when generating tiling strategy. The question is how to define such strategy. Previously we were thinking the simple way is to compare the M and N sizes, and make the outer loop with the smaller input size. But from the latest results, it looks like 4096x512x512 has better performance than 512x4096x512.

nod-ai / iree-amd-aie

Add control code scf.forall to scf.for pass #916