Closed yzhang93 closed 1 month ago
I would prefer a separate pass for this. The main reason is to allow for better optimization. The heurisitic "do transpose on side with fewer dims" isn't necessarily best. For example, after canonicalization, the side with fewer dims might change (contiguous dimensions can be eliminated). A secondary reason is to keep ConvertToDma as simple as possible (although this PR doesn't add much complexity).
What I have in mind is a pass
--rebalance-dma-cpy-nd
which basically converts from one form to the other. I haven't thought about how hard that is.
That said, this change is nice and small so I'm happy to accept this.
I would prefer a separate pass for this. The main reason is to allow for better optimization. The heurisitic "do transpose on side with fewer dims" isn't necessarily best. For example, after canonicalization, the side with fewer dims might change (contiguous dimensions can be eliminated). A secondary reason is to keep ConvertToDma as simple as possible (although this PR doesn't add much complexity).
What I have in mind is a pass
--rebalance-dma-cpy-nd
which basically converts from one form to the other. I haven't thought about how hard that is.
The goal of this PR is not trying to balance the dma dimensions. I think there's no reason to keep all transposed dimensions on the source side as the original codes does, because it is just one of the four combinations for dma generation (packTransposeOnSource, packTransposeOnTarget, unpackTransposeOnSource, unpackTransposeOnTarget). I don't see any benefit to convert the dma with all transposed dimensions on source side and then use another pass to make the source side continuous and move the strided pattern on the target side (as what I did previously in https://github.com/nod-ai/iree-amd-aie/pull/792).
The case I have in mind is a dma copy between 2 memrefs, one which is contiguous and one which is not. It might be better then to do the transposes on the side which is contiguous, because the side that is not contiguous will already need DMAs, because the dimensions can't collapse.
I'm confident I can write down examples of packs and unpacks where the transpose must be done on either source or target side to ensure you don't run out of dimensions... and probably cases with a mix where 1 dimension is transposed on source and 1 is on target. I just think that we might need a separate pass with better analysis after this PR lands. But for now, I'm ok with this if it gets us some of the way there.
I'm confident I can write down examples of packs and unpacks where the transpose must be done on either source or target side to ensure you don't run out of dimensions... and probably cases with a mix where 1 dimension is transposed on source and 1 is on target. I just think that we might need a separate pass with better analysis after this PR lands. But for now, I'm ok with this if it gets us some of the way there.
I am not convinced that we should create a separate pass for rebalancing. Ideally, we do it here in pack to DMA conversion as the logic seems the most straightforward at this point. The alternative of a separate pass gets quite complex as @yzhang93 pointed out: https://github.com/nod-ai/iree-amd-aie/pull/792. To accommodate different 'strategies' within this pack to dma conversion pass we can work with different options, like what @yzhang93 did, and I think that will get us quite far already.
Pack/unpack ops change the data layout and thus after converting to dma ops, the dma addressing dimensions are expanded/collapsed and transposed. Previously, all the dimension transpositions are on the source side of dma ops. This PR extends the usage to have an option for transposition happen on the target side.
In applications, we could make choices of transposition on source or target for pack or unpack ops based on performance and hardware dma requirements, etc. The motivation comes from this discussion, and this PR moves the dma optimization logic to an early pass where the dma ops are converted.
Note the default options are not changed in this PR (will enable it in a separate PR with other changes for dma optimization), but I have tested all four combinations locally to make sure the dma generations are correct and work e2e. The change of options can be added for example as