openxla / stablehlo

Backward compatible ML compute opset inspired by HLO/MHLO
Apache License 2.0
408 stars 112 forks source link

Figure out the story for hybrid quantization #1575

Closed burmako closed 7 months ago

burmako commented 1 year ago

At the moment, dot_general (as well as convolution as proposed in #1477) don't support hybrid quantization, e.g. float lhs and quantized rhs. However, this is an important practical use case. How do we represent it?

sdasgup3 commented 1 year ago

Thanks @burmako for bringing the topic. Let me add a bit context around it to further the discussion.

A few definitions which might be handy in the presentation:

Quantization Techniques

Let us have a look at convolution op which can be used to implement one of the the above techniques:

%result = stablehlo.convolution(%arg0, %arg1)
    dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f],
    window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}   
    {batch_group_count = 1 : i64, feature_group_count = 1 : i64} :     
    (tensor<?x5x5x3xf32>, tensor<3x3x3x38x!quant.uniform<i8:f32, 34.0:16>)
    -> tensor<?x3x3x38xf32>

We note that we also call this a hybrid op where operands are of different types (activation %arg0non- quantized, f32 vs weight %arg1 quantized, say qi8, where qi8 represents a quantized tensor type with 8-bit storage_type). There can be following two interpretations for such an hybrid op based on the fact that the op will have the semantics to make %arg0/%arg1 of the same type, either (qi8, qi8) or (f32, f32).

Op unifying the operand types to f32, also known as weight-only

The op, as part of its semantics, dequantizes the weight and does floating-point convolution producing a floating-point result. We note that, in general, the above op can be emulated by explicitly dequantizing the weight and then doing a convolution between floating point types.

Op unifying the operand types to qi8, also known as DRQ:

The op, as part of its semantics, calculates quantization parameters and quantize input activation, performs convolution(qi8, qi8) and dequantize the resulting accumulation.

Next let us talk about some of the pros and cons of expressing such an hybrid op in StableHLO

Pros

Cons

Current state with expressing hybrid ops in StableHLO

StableHLO , in its current form does not support hybrid ops, but we are excited to bring community feedback to know more about use-cases for such ops and their associated trade-offs.

A few additional notes

Please let me know your comments and feedback.

sdasgup3 commented 11 months ago

https://github.com/openxla/stablehlo/pull/1792 proposes semantic changes in StableHLO to support weight only quantization for convolution and dot_general ops.

Remaining tasks:

  1. Figure out if there are other ops (than dot_general and convolution) needs hybrid op support?
  2. Figure out the the story for dynamic range quantization?
sdasgup3 commented 7 months ago

With #1792 merged let us close this issue. We will open a separate ones for the remaining one https://github.com/openxla/stablehlo/issues/1575#issuecomment-1819957802 once we have more information around them.