How to quantize accumulator from int32 to uint8

google / gemmlowp

Low-precision matrix multiplication

Apache License 2.0

1.78k stars 451 forks source link

How to quantize accumulator from int32 to uint8 #188

Closed kmhatre14 closed 5 years ago

kmhatre14 commented 5 years ago

I am trying to implement the quantize version of MobilenNet v1 in OpenCL. I have referenced the method that you have provided in https://arxiv.org/pdf/1712.05877.pdf . I am using pretrained Mobilnet weights from the tflite file. I have got all the required quantization parameters Eg: S1 S2 and S3 from the tflite file. The only issue is converting the accumulator back from int32 to uint8. In the gemmlowp kernel it uses the min and max of the output tensor to quantize the accumulator from int32 to uint8, But for my implementation as I am using OpenCL, I cannot get the min and max values of the output tensor at runtime, I will have to write additional logic at the host side which will incur additional execution time.

M = (S1S2)/S3 To quantize the accumulator currently, i am using q = ((int32 M ) + bias) But this output does not match with intermediate output obtained from the tensorflow lite api.

bjacob commented 5 years ago

Our quantization scheme (that paper) assumes that the min-max range of all arrays involved, including the output activation tensor, are given. I don't know how to perform quantized inference in a way that's efficient and accurate without that datum.

kmhatre14 commented 5 years ago

I am using the data provided by the "mobilenet_v1_1.0_224_quant.tflite" So the tflite files provide me with the following details for the first layer. input : -1 ≤ 0.0078125 (q - 128) ≤ 0.9921875 weights : -3.265998125076294 ≤ 0.02182667888700962 (q - 151) ≤ 2.2779781818389893 bias : 0.00017052092880476266 * q

so from the above data, I assume S1 = 0.0078125 S2 = 0.02182667888700962 S3 = 0.00017052092880476266 Z1 = - 128 Z2 = - 151 Z3 = 0 Thus the Quantization Multiplier M = (S1*S2)/S3 After Converting into fixed point multiplier becomes int Multiplier = 1992157696 Bit shift = 7

final op =( (op(convolution int32) * Multiplier )/2^31 ) >> Shift

There is a slight difference between the op generated by the tflite code using tesorflow and the op generated by my code for the 1st layer. Eg: myop = 161, tensorflow lite = 163

But i think this difference amplifies and at the end i get wrong clasification

is the procedure right?

Thank you @bjacob

520jefferson commented 4 years ago

@kmhatre14 have you fixed the accuracy problem ? after the matrix multiplication, the error will enlarge.

kmhatre14 commented 4 years ago

The small difference in the output does not affect the classification. The output of our activation map is 99% similar to the activation maps generated by the Tflite. This does not affect the classification though and the confidence is also 99% same as the Tflite output.

The below code is tested and verified. https://github.com/Ushma30/MobileNet-V1/tree/MobileNet-V1-Quantized