[vectorization] Apply tiling only if element types vectorizable

newling commented 3 months ago

Vectorization in iree-amd-aie consists of 2 passes:

1) tile linalg.generic ops in all leading dimensions, so that batched matmuls, and matmuls which have been packed into high-dimesions with multiple reduction dimensions, get replaced by unbatched 'atomic' matmuls with scf.for loops

2) lower the 'atomic' matmuls to the vector dialect (vector.contract and other vector dialect casts/copies).

Before this PR, pass 1 applied to all linalg.generics, irrespective of their element type. However, pass 2 only applies to linalg.generics for which the operand element types have hardware support on AIE for vectorization. All linalg.generics with other types, like matmuls with i32 operands and result, are not converted to the vector dialect, they are later unrolled in a pass, linalg-to-loop (or something).

So basically pass 1, before this PR, might transform linalg.generics in preparation for vectorization, even though in pass 2 they are not vectorized. This is not a problem, but unnecessarily changes IR. More importantly, there has been a request to make pass 1 more conservative and leave unvectorizable ops alone, so make a later object-fifo related pass easier (@yzhang93 @Abhishek-Varma)

So that's what this PR does: makes the loop tiling only apply when the element types are vectorizable for AIE.

newling commented 3 months ago

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

newling commented 3 months ago

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) https://github.com/Xilinx/llvm-aie/issues/102 then I think this path will work.

nirvedhmeshram commented 3 months ago

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) Xilinx/llvm-aie#102 then I think this path will work.

I think we should still look into this from our side on why we have <2 x s32> when we were expecting to make scalar code, do we know where that happens? If not @newling could you please produce the ir dump with the standard *-dump-ir-after-all flags and in addition we may also need to pass --print-after-all flag to opt over here if its not our mlir lowering doing it.

newling commented 3 months ago

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) Xilinx/llvm-aie#102 then I think this path will work.

I think we should still look into this from our side on why we have <2 x s32> when we were expecting to make scalar code, do we know where that happens? If not @newling could you please produce the ir dump with the standard *-dump-ir-after-all flags and in addition we may also need to pass --print-after-all flag to opt over here if its not our mlir lowering doing it.

I think it's because linalg-to-loops only scalarizes the reduction dimensions of the packed linalg.generic.

So the insert-loops-for-vectorization, before this PR, was scalarizing (aka tiling to size 1) the outer m- and n- dimensions. But with this PR, that does not happen. See below (left is before this PR, right is after this PR).

newling commented 3 months ago

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) Xilinx/llvm-aie#102 then I think this path will work.

I think we should still look into this from our side on why we have <2 x s32> when we were expecting to make scalar code, do we know where that happens? If not @newling could you please produce the ir dump with the standard *-dump-ir-after-all flags and in addition we may also need to pass --print-after-all flag to opt over here if its not our mlir lowering doing it.

I think it's because linalg-to-loops only scalarizes the reduction dimensions of the packed linalg.generic.

So the insert-loops-for-vectorization, before this PR, was scalarizing (aka tiling to size 1) the outer m- and n- dimensions. But with this PR, that does not happen. See below (left is before this PR, right is after this PR).

I take that back, it is being completely scalarized (before and after). Just that before there is an intermediate memref.subview.

newling commented 3 months ago

One more data point: this issue only arises with matmul transpose. Transpose vs non-transpose below:

(The peano error arises in the matmul-transpose case, not the matmul case).

newling commented 3 months ago

Just to summarize my view of the situation:

This is all for matmul with i32, which is not our priority (and shouldn't be a priority for peano either)
matmul with i32 should not require any vectorization-specific passes to compile end-to-end
Disabling --insert-loops-for-vectorization for i32 makes peano compilation fail
So by pure luck, the pass --insert-loops-for-vectorization puts the IR (matmul transpose) in a state which peano handles
With or without --insert-loops-for-vectorization, the IR is correctly scalarized by --linalg-to-loops

nod-ai / iree-amd-aie

[vectorization] Apply tiling only if element types vectorizable #476