Should quanto use int dtype in AffineQuantizer instead of uint?

huggingface / optimum-quanto

A pytorch quantization backend for optimum

Apache License 2.0

796 stars 58 forks source link

Should quanto use int dtype in AffineQuantizer instead of uint? #134

Closed shuokay closed 6 months ago

shuokay commented 6 months ago

According to code in https://github.com/huggingface/quanto/blob/main/quanto/tensor/qbitstensor.py#L34 I find quanto use uint dtype to store the quantized value in affine quantizer, while in symmetric quantizer it is int dtype https://github.com/huggingface/quanto/blob/main/quanto/tensor/qtensor.py#L62.

Taking hardware into consideration, If we quantize both weight and activation to int types, will it save the cost of GPU or NPU since this only requires integer-type MAC arrays

shuokay commented 6 months ago

ref to https://oneapi-src.github.io/oneDNN/dev_guide_int8_computations.html#inputs-of-mixed-type-u8-and-s8

Instruction Set Architecture (ISA) has special instructions that enable multiplying and adding the vectors of u8 and s8 very efficiently. oneDNN enables int8 support using these particular instructions.

I think this is why many quantization frameworks quantize activations into uint types. However, many NPUs now directly support s8s8s32 computationst.(onednn also supports s8s8s32 now)

shuokay commented 6 months ago

cublasGemmEx() and cublasGemmBatchedEx() support s8s8s32 but not u8s8s32

dacorvo commented 6 months ago

The reason why AffineQuantizer uses int8 is because it uses a zero-point, hence all quantized values are positive. The actual value that should be used in mm is (d - zeropoint), which is an int8.

shuokay commented 6 months ago

Hi @dacorvo , I can't get the point of the relation between using zeropoint and all quantized values are positive

The reason why AffineQuantizer uses int8 is because it uses a zero-point, hence all quantized values are positive.

I think you are saying AffineQuantizer uses uint8. According to google's doc https://github.com/google/gemmlowp/blob/master/doc/quantization.md, the equation is: real_value = scale * (quantized_value -zeropoint), I guess there's no constraint that the quantized value has to be uint8, and the mentioned docs only said the the quantized values are typically uint8, but not saying must be uint8