Open bjacob opened 1 year ago
In the print-ir-after-all log from the above gist, the problem seems to arise at this point:
// -----// IR Dump After GenericVectorization (iree-codegen-generic-vectorization) //----- //
func.func @matmul_accumulate_1x1xf16_times_1x1xf16_into_1x1xf32_dispatch_3_mmt4d_1x1x1x1x1x1_f16xf16xf32() {
%cst = arith.constant 0.000000e+00 : f32
%cst_0 = arith.constant 0.000000e+00 : f16
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%c64 = arith.constant 64 : index
%c128 = arith.constant 128 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1x1x1x1xf16>>
%1 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c64) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1x1x1x1xf16>>
%2 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c128) : !flow.dispatch.tensor<readwrite:tensor<1x1x1x1xf32>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
scf.for %arg0 = %workgroup_id_y to %c1 step %workgroup_count_y {
scf.for %arg1 = %workgroup_id_x to %c1 step %workgroup_count_x {
%3 = flow.dispatch.tensor.load %0, offsets = [%arg0, 0, 0, 0], sizes = [1, 1, 1, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x1x1x1xf16>> -> tensor<1x1x1x1xf16>
%4 = flow.dispatch.tensor.load %1, offsets = [%arg1, 0, 0, 0], sizes = [1, 1, 1, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x1x1x1xf16>> -> tensor<1x1x1x1xf16>
%5 = flow.dispatch.tensor.load %2, offsets = [%arg0, %arg1, 0, 0], sizes = [1, 1, 1, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readwrite:tensor<1x1x1x1xf32>> -> tensor<1x1x1x1xf32>
%6 = vector.transfer_read %3[%c0, %c0, %c0, %c0], %cst_0 {in_bounds = [true, true, true, true]} : tensor<1x1x1x1xf16>, vector<1x1x1x1xf16>
%7 = vector.transfer_read %4[%c0, %c0, %c0, %c0], %cst_0 {in_bounds = [true, true, true, true]} : tensor<1x1x1x1xf16>, vector<1x1x1x1xf16>
%8 = vector.transfer_read %5[%c0, %c0, %c0, %c0], %cst {in_bounds = [true, true, true, true]} : tensor<1x1x1x1xf32>, vector<1x1x1x1xf32>
%9 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d2, d3, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d1, d2, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d3, d4)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction"], kind = #vector.kind<add>} %6, %7, %8 : vector<1x1x1x1xf16>, vector<1x1x1x1xf16> into vector<1x1x1x1xf32>
%10 = vector.transfer_write %9, %5[%c0, %c0, %c0, %c0] {in_bounds = [true, true, true, true]} : vector<1x1x1x1xf32>, tensor<1x1x1x1xf32>
flow.dispatch.tensor.store %10, %2, offsets = [%arg0, %arg1, 0, 0], sizes = [1, 1, 1, 1], strides = [1, 1, 1, 1] : tensor<1x1x1x1xf32> -> !flow.dispatch.tensor<readwrite:tensor<1x1x1x1xf32>>
}
}
return
}
Here, vectorization has turned the mixed-types linalg.matmul
into a mixed-types vector.contract
. This is the likely root cause problem, because unlike linalg.matmul
, vector.contract
does not really support mixed-types. It's a bit of a gray area because its effective semantics are not fully defined and in practice defined by lowerings, but a bunch of existing code treats vector.contract
as not supporting mixed types and the error generated here seems to be about ConvertToLLVM not supporting mixed-types vector.contract
.
The likely fix is this to fix GenericVectorization (iree-codegen-generic-vectorization) to generate arith.ext*
ops on the operands of the vector.contract
to make all its operands share the same element type.
To add to @bjacob's comment, that fix is right, but I think on GPU side mixed types are supported in hardware. So with the fix to the vectorizer on the GPU side an extra folding pattern should also be added to fold the ext -> vector.contract back to the right instruction.
Basically we should stop relying on mixed types vector.contract. best will be if all patches on this path are tested on CI in IREE before landing in LLVM to find regressions before hand
on GPU side mixed types are supported in hardware
Nit: CPU ISAs have mixed-types multiplication instructions, too. So the difference between CPU and GPU here is not about ISA (let alone hardware). It's a compiler implementation choice: whereas Codegen/CPU has chosen to treat mixed-types vector.contract as illegal, Codegen/GPU has (if I understand correctly) actually provided lowerings for it, and thus expects a mixed-types contraction to be represented as a single mixed-types vector.contract, not multiple ops (arith.ext -> vector.contract) and we should avoid causing any regression on GPU by changing that.
We have been compiling i8, i8 -> i32
for a while, at least for RISC-V, but I don't remember what happened at vector.contract
level. I also remember some examples from SVE/SME? (@banach-space)
According to the doc, vector.contract
supports different types for the accumulator:
// Vector contraction with mixed typed. lhs/rhs have different element
// types than accumulator/result.
%5 = vector.contract #contraction_trait %0, %1, %2
: vector<10xf16>, vector<10xf16> into f32
vector.contract
is a relatively high-level operation that still preserves quite some structure so using a wider accumulator seems something that fits well at that level, to some extent. Not sure if the issue is at the LLVM conversion level, though, because this op should have been lowered into a lower level vector representation (outerproduct, fmas, etc.) far before we lower to LLVM itself... and it doesn't look like that is the case:
%71 = llvm.insertvalue %70, %3[0] : !llvm.array<1 x array<1 x vector<1xf16>>>
%72 = builtin.unrealized_conversion_cast %71 : !llvm.array<1 x array<1 x vector<1xf16>>> to vector<1x1x1xf16>
%73 = llvm.insertvalue %53, %4[0] : !llvm.array<1 x vector<1xf16>>
%74 = builtin.unrealized_conversion_cast %73 : !llvm.array<1 x vector<1xf16>> to vector<1x1xf16>
%75 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1)>], iterator_types = ["reduction", "parallel", "reduction"], kind = #vector.kind<add>} %74, %72, %69 : vector<1x1xf16>, vector<1x1x1xf16> into vector<1xf32>
%76 = llvm.extractelement %75[%1 : i64] : vector<1xf32>
%77 = llvm.insertelement %76, %7[%9 : i64] : vector<1xf32>
%78 = llvm.extractelement %77[%9 : i64] : vector<1xf32>
%79 = llvm.getelementptr %26[32] : (!llvm.ptr) -> !llvm.ptr, f32
I think there's been back and forth of this as the status of mixed-types vector.contract wasn't clarified early, so at the moment we are left with different parts of the code taking a different stance on this. For example, the compile error that this Issue is about suggests that ConvertToLLVM doesn't fully support mixed-types vector.contract. Some 2 years ago, I prepared a LLVM PR (then under Phabricator ) to make vector.contract lowerings more fully support mixed types, and that was turned down by @nicolasvasilache as not something that we actually wanted to support. I still think that Nicolas' decision then was the right one.
Trying to support mixed-types vector.contract runs into tricky issue when one thinks about integer arithmetic support. As we use signless integers throughout, a mixed-types integer vector.contract can't know whether to interpret operands as signed or unsigned. So, with integer types, we have to have explicit arith.extui/extsi on operands anyway, and which point we don't need mixed-type vector.contract anymore.
All in all, while it certainly is possible to resolve this by treating mixed-type vector.contract as legal and trying to make ConvertToLLVM handle it (except perhaps for signless integers, where that seems under-defined), I believe that here the simpler resolution is to treat it as something to avoid generating in vectorization.
There is a default behavior for integers but it can be kind of surprising to the user:
If operands and the result have types of different bitwidths, operands are promoted to have the same
bitwidth as the result before performing the contraction. For integer types, only signless integer types
are supported, and the promotion happens via sign extension.
We are not generating vector.contract
ops directly from the vectorizer. There are some patterns that are matching vector reduction ops and raising them to vector.contract
so we should fix those if we decide to go with that approach.
I don't have a preference but I may see people using this inconsistently. I wouldn't be surprised if we had some direct conversions from mixed-type vector.contract
ops to Nvidia mma ops (totally guessing here).
There is a default behavior for integers but it can be kind of surprising to the user:
If operands and the result have types of different bitwidths, operands are promoted to have the same bitwidth as the result before performing the contraction. For integer types, only signless integer types are supported, and the promotion happens via sign extension.
We are not generating
vector.contract
ops directly from the vectorizer. There are some patterns that are matching vector reduction ops and raising them tovector.contract
so we should fix those if we decide to go with that approach.I don't have a preference but I may see people using this inconsistently.
The only way to fix it is to add that to the verifier to ensure consistent usage. But that will break downstream users. We should send RFC and make the change. First order is to remove such behavior in IREE.
I wouldn't be surprised if we had some direct conversions from mixed-type
vector.contract
ops to Nvidia mma ops (totally guessing here).
We probably do, but that should be changed to a peephole that matches a DAG that includes the ext to fold it into the nvgpu op
I generally agree with the discussion here.
One thing to keep in mind however is that representing some things we care about in LLVM can get quite tricky. I even wonder whether some cases are possible at all.
For instance, I dug up the peephole that matches a DAG that includes the ext to fold it
for pmaddubsw
https://github.com/llvm/llvm-project/blob/3651f377f632d9152f2bd8fc2f9302ec9f9bdd5e/llvm/test/CodeGen/X86/pmaddubsw.ll#L10 and it can get pretty hairy.
I also remember some examples from SVE/SME? (@banach-space)
For SME there are multiple widening outer-product instructions, e.g.:
It's important that we can target these from MLIR - I am not expecting to get much help from the backend in form of e.g. peephole optimisations. There are some SVE examples of widening instructions too, but for 1d scalable vectors we can expect more help from the backend (worth double-checking though).
The likely fix is this to fix GenericVectorization (iree-codegen-generic-vectorization) to generate arith.ext* ops on the operands of the vector.contract to make all its operands share the same element type.
This should be fine for SME - we could provide our our conversion patterns for SME to target the widening outer-product instructions/intrinsics.
Given the issue with representing signed/unsigned extension semantics at the vector level, to be able to target dot prod instructions with mixed data types, we probably need to start mirroring those in dialects in MLIR that are closer to hardware (like the NVGPU dialect, etc.)
we probably need to start mirroring those in dialects in MLIR that are closer to harder (like the NVGPU dialect, etc.)
That's our plan for SME's outer product, but even then we will have to match patterns like this (hypothetical example):
%lhs = arith.sext %lhs_i8
%rhs = arith.sext %rhs_i8
vector.outerproduct %lhs_i8, %rhs_i8, %acc : vector<4xi32>, vector<4xi32>, vector<4x4xi32>
Also, do you mean mirroring higher level vector Ops like vector.contract
, or just to lower level like vector.outerproduct
?
we probably need to start mirroring those in dialects in MLIR that are closer to harder (like the NVGPU dialect, etc.)
That's our plan for SME's outer product, but even then we will have to match patterns like this (hypothetical example):
%lhs = arith.sext %lhs_i8 %rhs = arith.sext %rhs_i8 vector.outerproduct %lhs_i8, %rhs_i8, %acc : vector<4xi32>, vector<4xi32>, vector<4x4xi32>
Also, do you mean mirroring higher level vector Ops like
vector.contract
, or just to lower level likevector.outerproduct
?
As it exists today, I think that would be for all vector
dialect ops. Without this I am not sure how to represent the right sign extension semantics for examples.
Wouldn't just adding a signed/unsigned attribute to the existing dictionary fix the problem without changing the existing behavior or introducing large matches or major changes? What would be the problem with that?
Wouldn't just adding a signed/unsigned attribute to the existing dictionary fix the problem without changing the existing behavior or introducing large matches or major changes? What would be the problem with that?
I am not sure how future proof is that. That works for integer to integer conversion. How would that account for float/int mixed types. etc.. Also things being different on different architecture, this folding probably needs to happen after the generic vectorization.
I don't think I see why it wouldn't work for the float/int cases but probably missing something... We could also introduce multiple variants of the contract op (signed, unsigned, fp, ...).
My point is that there seems to be some value in abstracting away some of the low-level details of the contract at this level (same argument has been made for xfer ops, for example, and vector contract is at a similar level of abstraction), esp. if it matches/facilitates the lowering to existing target instructions that are at the same level of abstraction (the example that Nicolas shared is scary :)). That's kind of what the progressive lowering story is about. The main problems I see are the representation gaps and loose semantics that the op currently has. We could work on those. For example, we may not want vector.contract
to support inputs with different types or use an accumulator of type X and finally return type Y. Those are things that can be easily built around a vector.contract
with tighter semantics.
However, if you prefer to leave the input extension outside of the op, I'm also ok with that but I think it's going to be a rocky but not necessarily better road.
Thanks for the discussion!
What I would like to see are concrete examples, bottom up, with serious coverage of the behaviors we want to hit. Until we have comparable test coverage to https://github.com/llvm/llvm-project/tree/3651f377f632d9152f2bd8fc2f9302ec9f9bdd5e/llvm/test/CodeGen/X86 for the things we care about, we are not doing it right.
What is the lowest level of MLIR dialects we actually want to generate that is guaranteed to get the right operations and how do we iterate factoring this up into the right semantics for an op like vector.contract
?
I'd suggest starting from Benoit's great compendium in https://gist.github.com/bjacob/62e7143229fd142b45c171c07e0a7f5a#the-table and write one test for each.
we probably need to start mirroring those in dialects in MLIR that are closer to harder (like the NVGPU dialect, etc.)
That's our plan for SME's outer product, but even then we will have to match patterns like this (hypothetical example):
%lhs = arith.sext %lhs_i8 %rhs = arith.sext %rhs_i8 vector.outerproduct %lhs_i8, %rhs_i8, %acc : vector<4xi32>, vector<4xi32>, vector<4x4xi32>
Also, do you mean mirroring higher level vector Ops like
vector.contract
, or just to lower level likevector.outerproduct
?
I started reading from the middle of the thread, but I think the general rules for ops like this (e.g., vector.reduce
, vector.contract
, etc.) that only accept signless integers is that the semantics must be fully defined by the op and it's attributes. For example, vector.reduction <add>
is fine because integer addition is signless, but something like min
/max
is not, and we can't rely on any zero/sign-extension operands to define what the reduction means. IE for the same, bit-identical, operands, it must produce the same result regardless of which ops these come from.
For reduction over signless operands we need signed and unsigned variants of min
/max
. I think this is currently a bug in the vector.reduction
op definition which reads:
vector.reduction (vector::ReductionOp)
¶ Reduction operationSyntax:
operation ::= `vector.reduction` $kind `,` $vector (`,` $acc^)? (`fastmath` `` $fastmath^)? attr-dict `:` type($vector) `into` type($dest)
Reduces an 1-D vector “horizontally” into a scalar using the given operation (add/mul/min/max for int/fp and and/or/xor for int only). Reductions also allow an optional fused accumulator.
Ouch, we would also have to add the unsigned/signed min/max ops to the contraction kind if it's not already there.
To unblock this, I think we could just add the optional expansion
attribute to the contraction traits dict and set it to signed
/unsigned
for integer inputs. It the simplest fix, preserves existing behavior and I can't think of a case where it wouldn't work. Then, we could think later if we want to explore other routes. WDYT?
Ouch, we would also have to add the unsigned/signed min/max ops to the contraction kind if it's not already there. To unblock this, I think we could just add the optional
expansion
attribute to the contraction traits dict and set it tosigned
/unsigned
for integer inputs. It the simplest fix, preserves existing behavior and I can't think of a case where it wouldn't work. Then, we could think later if we want to explore other routes. WDYT?
The signed/unsigned
wouldnt work for float types.... The fix really should be to deprecate min/max
and add smin/smax/umin/umax
attributes.
The
signed/unsigned
wouldnt work for float types....
Sorry, I'm still missing why not. Would you mind providing a concrete example? That would help.
The fix really should be to deprecate
min/max
and addsmin/smax/umin/umax
attributes.
Yes, that's what I said but we also have to fix the integer extension problem for the rest of the combiners.
The
signed/unsigned
wouldnt work for float types....Sorry, I'm still missing why not. Would you mind providing a concrete example? That would help.
The fix really should be to deprecate
min/max
and addsmin/smax/umin/umax
attributes.Yes, that's what I said but we also have to fix the integer extension problem for the rest of the combiners.
Turns out that signed/unsigned min/max are already there, it's just that the documentation doesn't make it super clear: https://github.com/llvm/llvm-project/blob/72e6c1c70d5e07bbc8cb7cae2ed915108daf93aa/mlir/include/mlir/Dialect/Vector/IR/VectorAttributes.td#L22-L27
Is this really solved or merely worked around ? Reopening for now.
This is fixed:
This arises when trying to add e2e matmul tests with mixed-types such as
f16 x f16 -> f32
, enabling data-tiling, and not using microkernels (as the issue is in vectorization, which is circumvented by microkernels).Testcase:
Compile:
Print-ir-after-all log stored in https://gist.github.com/bjacob/4ff73e95e3a21abc70f2ef29a1c12300
The issue is that vectorization creates a mixed-types vector.contract, which is then unhandled in ConvertToLLVM.