Is this product range of int8*int8 in comment document expected?

google / gemmlowp

Low-precision matrix multiplication

Apache License 2.0

1.77k stars 451 forks source link

Is this product range of int8*int8 in comment document expected? #174

Closed zhenhuaw-me closed 5 years ago

zhenhuaw-me commented 5 years ago

Hi there,

In https://github.com/google/gemmlowp/blob/58825b1f969451fc0462148f38d016b253fb40e9/internal/kernel_neon.h#L708-L709 I guess the product range should be [ (-2^7)*(2^7 - 1), (-2^7)*(-2^7)] which is in (-2^14, 2^14] - closed for 2^14? If we are putting a int8*int8 + int8*int8 in int16, do we need the assumption that -128 is not included in int8 (from B. Appendix: ARM NEON details of the paper)? To me, if int8 takes -128, the int8*int8 + int8*int8 could be as large as 2^15 which cannot be hold in int16.

Thanks

zhenhuaw-me commented 5 years ago

Looked into the tflite/gemmlowp stack. So, for quantized conv https://github.com/tensorflow/tensorflow/blob/c865ec5621c013a7f8a4a26d380782e63117224f/tensorflow/lite/kernels/internal/optimized/optimized_ops.h#L2082-L2085 , which loads lhs (filter, value range $(0, 255]$) and rhs (input, value range $[0, 255]$), the int8*int8 + int8*int8 value size can hold in int16.

However, I am not sure how the input uint8 data https://github.com/google/gemmlowp/blob/58825b1f969451fc0462148f38d016b253fb40e9/internal/kernel_neon.h#L942-L945 which is loaded by https://github.com/google/gemmlowp/blob/58825b1f969451fc0462148f38d016b253fb40e9/internal/kernel_neon.h#L958 can be computed by signed instructions like and https://github.com/google/gemmlowp/blob/58825b1f969451fc0462148f38d016b253fb40e9/internal/kernel_neon.h#L988-L989

Would you please give a hint?

bjacob commented 5 years ago

You are right that the comment at kernel_neon.h:708 is incorrect. It fails to mention that in order to avoid overflow in int16 := int8*int8 + int8*int8, it is necessary to require the int8 values to avoid the value -128.

As you found, this has been amended in the paper and in the way that TFLite uses this. As you found, there is a signedness discrepancy between on the one hand, the 8bit buffers in TFlite and at the API surface of gemmlowp, where everything is unsigned uint8, and in the kernels internally in gemmlowp, where everything is signed int8. The switch from unsigned to signed is implemented in the 'packing' phase of gemmlowp.

For the pack/compute/unpack phases of gemmlowp, refer to this doc: https://github.com/google/gemmlowp/blob/master/doc/design.md The portable (not NEON) implementation of the packing phase is this file: https://github.com/google/gemmlowp/blob/master/internal/pack.h Inside it, here is where the unsigned->signed conversion occurs: https://github.com/google/gemmlowp/blob/58825b1f969451fc0462148f38d016b253fb40e9/internal/pack.h#L272-L273

zhenhuaw-me commented 5 years ago

Thank you @bjacob for the detailed knowledge sharing! That's really helpful! I didn't notice that there is a packing process in gemmlowp, should have read the docs carefully.