nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
64 stars 29 forks source link

[vectorization] Apply tiling only if element types vectorizable #476

Closed newling closed 3 months ago

newling commented 3 months ago

Vectorization in iree-amd-aie consists of 2 passes:

1) tile linalg.generic ops in all leading dimensions, so that batched matmuls, and matmuls which have been packed into high-dimesions with multiple reduction dimensions, get replaced by unbatched 'atomic' matmuls with scf.for loops

2) lower the 'atomic' matmuls to the vector dialect (vector.contract and other vector dialect casts/copies).

Before this PR, pass 1 applied to all linalg.generics, irrespective of their element type. However, pass 2 only applies to linalg.generics for which the operand element types have hardware support on AIE for vectorization. All linalg.generics with other types, like matmuls with i32 operands and result, are not converted to the vector dialect, they are later unrolled in a pass, linalg-to-loop (or something).

So basically pass 1, before this PR, might transform linalg.generics in preparation for vectorization, even though in pass 2 they are not vectorized. This is not a problem, but unnecessarily changes IR. More importantly, there has been a request to make pass 1 more conservative and leave unvectorizable ops alone, so make a later object-fifo related pass easier (@yzhang93 @Abhishek-Varma)

So that's what this PR does: makes the loop tiling only apply when the element types are vectorizable for AIE.

newling commented 3 months ago

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

newling commented 3 months ago

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) https://github.com/Xilinx/llvm-aie/issues/102 then I think this path will work.

nirvedhmeshram commented 3 months ago

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) Xilinx/llvm-aie#102 then I think this path will work.

I think we should still look into this from our side on why we have <2 x s32> when we were expecting to make scalar code, do we know where that happens? If not @newling could you please produce the ir dump with the standard *-dump-ir-after-all flags and in addition we may also need to pass --print-after-all flag to opt over here if its not our mlir lowering doing it.

newling commented 3 months ago

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) Xilinx/llvm-aie#102 then I think this path will work.

I think we should still look into this from our side on why we have <2 x s32> when we were expecting to make scalar code, do we know where that happens? If not @newling could you please produce the ir dump with the standard *-dump-ir-after-all flags and in addition we may also need to pass --print-after-all flag to opt over here if its not our mlir lowering doing it.

I think it's because linalg-to-loops only scalarizes the reduction dimensions of the packed linalg.generic.

So the insert-loops-for-vectorization, before this PR, was scalarizing (aka tiling to size 1) the outer m- and n- dimensions. But with this PR, that does not happen. See below (left is before this PR, right is after this PR).

image

newling commented 3 months ago

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) Xilinx/llvm-aie#102 then I think this path will work.

I think we should still look into this from our side on why we have <2 x s32> when we were expecting to make scalar code, do we know where that happens? If not @newling could you please produce the ir dump with the standard *-dump-ir-after-all flags and in addition we may also need to pass --print-after-all flag to opt over here if its not our mlir lowering doing it.

I think it's because linalg-to-loops only scalarizes the reduction dimensions of the packed linalg.generic.

So the insert-loops-for-vectorization, before this PR, was scalarizing (aka tiling to size 1) the outer m- and n- dimensions. But with this PR, that does not happen. See below (left is before this PR, right is after this PR).

image

I take that back, it is being completely scalarized (before and after). Just that before there is an intermediate memref.subview.

newling commented 3 months ago

One more data point: this issue only arises with matmul transpose. Transpose vs non-transpose below:

image

(The peano error arises in the matmul-transpose case, not the matmul case).

newling commented 3 months ago

Just to summarize my view of the situation: