oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.6k stars 990 forks source link

[Quantization] Any way to simulate asymmetric quantization? #665

Closed masahi closed 4 years ago

masahi commented 4 years ago

Hi, from the doc https://intel.github.io/mkl-dnn/dev_guide_attributes_quantization.html, it seems DNNL quantized convolution supports only symmetric quantization. But I have a use case where I want to execute quantized conv op that comes from PyTorch, and there are some non-zero zero points.

Is there a way to simulate quantized convolution with non-zero zero points using DNNL? Performance is not too important for me right now.

Is

Manual shift -> normal int 32 conv -> requantize

an good approach?

emfomenk commented 4 years ago

Hi @masahi,

is manual shift a good approach?

No, if you want to keep using int8 data types. The reason is that once the shift is applied the data could be no more representable in int8 (say, the original value was 3, and zero point was 200 --> once subtracted the value will be -193 which doesn't belong to s8).

If you are fine with emulating int8 computations using floating point operations -- the approach you suggested should work. The only pitfall is that the result might be slightly different result if during the computations the rounding happens (f32 has only 23 mantissa bits, hence if any intermediate data is greater than 2^23 it couldn't be exactly represented in f32 data type --> hence rounding).

The alternative approach would be to do two step computations (assuming that nontrivial zero point is applied to the source data tensor only):

  1. Compute int8 convolution with s32 output (w/o any scaling and post-ops) w/o taking zero points into account
  2. Compute int8 convolution with s32 output with a special input -- broadcasted zero-point.
  3. Subtract the second tensor from the first one.
  4. Apply all (re-)quantization scaling and post-ops

This is conceptually how the library would implement the convolution with non-trivial zero points. However, this is much more intrusive way and actually even quite inefficient (the slowdown is >2x compared to the convolution w/o zero point). Given also the complex API the library has, I would suggest to avoid going this route :)

Summary:

  1. If the performance is not a concern at all, probably using the implementation from the framework is the way to go
  2. If the previous bullet doesn't work (say for whatever reason the performance there is awful), manual shift could be used (but don't forget to change the data type). Whether to use framework here or DNNL -- up to you.
  3. As long as performance is concerned, the only way to go is to have a proper implementation in the library.

P.S. A nice explanation how could implementation handle the zero points efficiently could be found in gemmlowp docs.

masahi commented 4 years ago

@emfomenk Thanks very much for the detailed answer. My use case is to convert quantized PyTorch models to TVM and run them on more backends. So using PyTorch implementation is not an option.

TVM community is developing a mechanism to easily plug in external libraries like TensorRT and DNNL to their compilation pipeline. See my PR https://github.com/apache/incubator-tvm/pull/4741 for example, where I demonstrate using DNNL's fused conv op from TVM. My next step is to do the same exercise for quantized ops, and for that I need to handle asymmetry. Since this is mostly demo purpose and having sort of reliable "ground truth" is more important, I don't care about performance for now.

The gemmlowp approach of decomposing qconv into 4 terms is also how TVM handles asymmetry. See https://github.com/apache/incubator-tvm/blob/0755e4a58897c64d6a7ffc86bab3df45554bac7e/src/relay/qnn/op/convolution.cc#L512-L580

Decomposing and executing decomposed ops with DNNL seems like a good plan. But it would be nicer if the library can handle it automatically. Since both PyTorch and Tensorflow generate non zero zero points, I think there are good use cases. (Of course, users should be aware that it would be slower than symmetric quantization).