Closed burmako closed 7 months ago
Thanks @burmako for bringing the topic. Let me add a bit context around it to further the discussion.
A few definitions which might be handy in the presentation:
Let us have a look at convolution op which can be used to implement one of the the above techniques:
%result = stablehlo.convolution(%arg0, %arg1)
dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f],
window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}
{batch_group_count = 1 : i64, feature_group_count = 1 : i64} :
(tensor<?x5x5x3xf32>, tensor<3x3x3x38x!quant.uniform<i8:f32, 34.0:16>)
-> tensor<?x3x3x38xf32>
We note that we also call this a hybrid op where operands are of different types (activation %arg0
non- quantized, f32
vs weight %arg1
quantized, say qi8
, where qi8
represents a quantized tensor type with 8-bit storage_type
). There can be following two interpretations for such an hybrid op based on the fact that the op will have the semantics to make %arg0
/%arg1
of the same type, either (qi8, qi8)
or (f32, f32)
.
f32
, also known as weight-onlyThe op, as part of its semantics, dequantizes the weight and does floating-point convolution producing a floating-point result. We note that, in general, the above op can be emulated by explicitly dequantizing the weight and then doing a convolution between floating point types.
qi8
, also known as DRQ:The op, as part of its semantics, calculates quantization parameters and quantize input activation, performs convolution(qi8, qi8)
and dequantize the resulting accumulation.
Next let us talk about some of the pros and cons of expressing such an hybrid op in StableHLO
convolution(f32, qi8)
seems ambiguous. The specification of each op supporting hybrid scheme needs to specify both the variants. StableHLO , in its current form does not support hybrid ops, but we are excited to bring community feedback to know more about use-cases for such ops and their associated trade-offs.
stablehlo.custom
operation to support DRQ. Please let me know your comments and feedback.
https://github.com/openxla/stablehlo/pull/1792 proposes semantic changes in StableHLO to support weight only quantization for convolution and dot_general ops.
Remaining tasks:
With #1792 merged let us close this issue. We will open a separate ones for the remaining one https://github.com/openxla/stablehlo/issues/1575#issuecomment-1819957802 once we have more information around them.
At the moment, dot_general (as well as convolution as proposed in #1477) don't support hybrid quantization, e.g. float lhs and quantized rhs. However, this is an important practical use case. How do we represent it?