Closed kmhatre14 closed 5 years ago
Our quantization scheme (that paper) assumes that the min-max range of all arrays involved, including the output activation tensor, are given. I don't know how to perform quantized inference in a way that's efficient and accurate without that datum.
I am using the data provided by the "mobilenet_v1_1.0_224_quant.tflite" So the tflite files provide me with the following details for the first layer. input : -1 ≤ 0.0078125 (q - 128) ≤ 0.9921875 weights : -3.265998125076294 ≤ 0.02182667888700962 (q - 151) ≤ 2.2779781818389893 bias : 0.00017052092880476266 * q
so from the above data, I assume S1 = 0.0078125 S2 = 0.02182667888700962 S3 = 0.00017052092880476266 Z1 = - 128 Z2 = - 151 Z3 = 0 Thus the Quantization Multiplier M = (S1*S2)/S3 After Converting into fixed point multiplier becomes int Multiplier = 1992157696 Bit shift = 7
final op =( (op(convolution int32) * Multiplier )/2^31 ) >> Shift
There is a slight difference between the op generated by the tflite code using tesorflow and the op generated by my code for the 1st layer. Eg: myop = 161, tensorflow lite = 163
But i think this difference amplifies and at the end i get wrong clasification
is the procedure right?
Thank you @bjacob
@kmhatre14 have you fixed the accuracy problem ? after the matrix multiplication, the error will enlarge.
The small difference in the output does not affect the classification. The output of our activation map is 99% similar to the activation maps generated by the Tflite. This does not affect the classification though and the confidence is also 99% same as the Tflite output.
The below code is tested and verified. https://github.com/Ushma30/MobileNet-V1/tree/MobileNet-V1-Quantized
I am trying to implement the quantize version of MobilenNet v1 in OpenCL. I have referenced the method that you have provided in https://arxiv.org/pdf/1712.05877.pdf . I am using pretrained Mobilnet weights from the tflite file. I have got all the required quantization parameters Eg: S1 S2 and S3 from the tflite file. The only issue is converting the accumulator back from int32 to uint8. In the gemmlowp kernel it uses the min and max of the output tensor to quantize the accumulator from int32 to uint8, But for my implementation as I am using OpenCL, I cannot get the min and max values of the output tensor at runtime, I will have to write additional logic at the host side which will incur additional execution time.
M = (S1S2)/S3 To quantize the accumulator currently, i am using q = ((int32 M ) + bias) But this output does not match with intermediate output obtained from the tensorflow lite api.