Figure out the story for hybrid quantization

burmako commented 1 year ago

At the moment, dot_general (as well as convolution as proposed in #1477) don't support hybrid quantization, e.g. float lhs and quantized rhs. However, this is an important practical use case. How do we represent it?

sdasgup3 commented 1 year ago

Thanks @burmako for bringing the topic. Let me add a bit context around it to further the discussion.

A few definitions which might be handy in the presentation:

Quantization Techniques

Weight-only quantization: Only quantize the weights. Sometimes, weight-only quantization is simulated, meaning a dequant op exists after quantized const, running float kernel at inference.
Dynamic Range (also known as DRQ) : Convert weights to reduced precision integer ahead of time, while quantizing activations based on data range (min/max) observed at runtime.

Let us have a look at convolution op which can be used to implement one of the the above techniques:

%result = stablehlo.convolution(%arg0, %arg1)
    dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f],
    window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}   
    {batch_group_count = 1 : i64, feature_group_count = 1 : i64} :     
    (tensor<?x5x5x3xf32>, tensor<3x3x3x38x!quant.uniform<i8:f32, 34.0:16>)
    -> tensor<?x3x3x38xf32>

We note that we also call this a hybrid op where operands are of different types (activation %arg0non- quantized, f32 vs weight %arg1 quantized, say qi8, where qi8 represents a quantized tensor type with 8-bit storage_type). There can be following two interpretations for such an hybrid op based on the fact that the op will have the semantics to make %arg0/%arg1 of the same type, either (qi8, qi8) or (f32, f32).

Op unifying the operand types to `f32`, also known as weight-only

The op, as part of its semantics, dequantizes the weight and does floating-point convolution producing a floating-point result. We note that, in general, the above op can be emulated by explicitly dequantizing the weight and then doing a convolution between floating point types.

Op unifying the operand types to `qi8`, also known as DRQ:

The op, as part of its semantics, calculates quantization parameters and quantize input activation, performs convolution(qi8, qi8) and dequantize the resulting accumulation.

Next let us talk about some of the pros and cons of expressing such an hybrid op in StableHLO

Pros

Less prone to constant folding: For example, the problem with explicitly emulating the dequantization step for weight-only op is that the dequantization step can be constant-folded producing an all floating-point convolution. This might be a problem for downstream consumers who are expecting to pattern match "dequant+floating-point conv" to re-create the hybrid op supported in downstream compilers.
Less pattern matching overhead for downstream consumers of StableHLO.

Cons

Ambiguous: With the two different interpretations for hybrid op, convolution(f32, qi8) seems ambiguous. The specification of each op supporting hybrid scheme needs to specify both the variants.
Implementation defined: The op behavior will be dependent on implementation: Some implementer can choose dynamic range quantization and others can use it for weight only.

Current state with expressing hybrid ops in StableHLO

StableHLO , in its current form does not support hybrid ops, but we are excited to bring community feedback to know more about use-cases for such ops and their associated trade-offs.

A few additional notes

The hybrid op, with DRQ semantics, is currently supported only in TFLite CPU runtime ref.
Other than TFLite CPU, the hybrid op can be executed on other hardware implementations
- Some implementation, with efficient floating-point support, can execute such ops in the decomposed form (first de-quantized to floating-point before having a floating-point computation). If the implementation consumes StableHLO, then the pros and cons discussed above would play an important role.
  - Following the feedback, it seems that in the long run, there could be some ideas to support dynamic range quantization explicitly at the graph level. For the short term, we can use stablehlo.custom operation to support DRQ.

Please let me know your comments and feedback.

sdasgup3 commented 11 months ago

https://github.com/openxla/stablehlo/pull/1792 proposes semantic changes in StableHLO to support weight only quantization for convolution and dot_general ops.

Remaining tasks:

Figure out if there are other ops (than dot_general and convolution) needs hybrid op support?
Figure out the the story for dynamic range quantization?

sdasgup3 commented 7 months ago

With #1792 merged let us close this issue. We will open a separate ones for the remaining one https://github.com/openxla/stablehlo/issues/1575#issuecomment-1819957802 once we have more information around them.

openxla / stablehlo