Closed ezhulenev closed 5 years ago
@ezhulenev, thanks for the report. Could you please clarify which version of the library are you using? Does it reproduce with the latest master?
I'm testing with https://github.com/intel/mkl-dnn/commit/7de7e5d02bf687f971e7668963649728356e0c20, will check with the latest revision tomorrow.
I've reproduced this issue with the latest version for the case when MKL-DNN uses MKL. We will look into this.
Hi @ezhulenev,
The problem relates to the vpmaddubsw
instruction that is used in Intel MKL/Intel MKL-DNN igemms.
Note, that the instruction performs the following: u8 * s8 + u8 * s8 -> s16 (with saturation)
.
Intel MKL / Intel MKL-DNN s8s8_igemm works as follows:
B_s8
to make it B_u8
C'_s32 <-- A_s8 * B_u8
:
A_s8
and two elements of B_u8
are taken to compute s16
value (using vpmaddubsw
instruction)s16
values are summed in s32 matrix C'_s32
(using vpmaddwd
instruction)128*B_s8
from matrix C'_s32
to get matrix C_s32
The problem happens at step 2.i. due to saturation in vpmaddubsw
(s8*u8 + s8*u8
might not fit into s16
).
I've created a small reproducer with a reference code that can mimic Skylake behavior. See this gist, specifically igemm_repro.c
at line 23. I also attached the output of Intel MKL-DNN on Broadwell (avx2), Skylake (avx512), and CascadeLake (avx512 VNNI).
For Skylake the library gives the result which corresponds to the shifted matrix B with the intermediate int16 saturation (compare the first and the last output). 🔴
For Broadwell Intel MKL-DNN gives the exact result, because we don't have int8 optimized igemm that uses these instructions. Instead a reference code is used. ✔️
For Cascade Lake the library again shows the exact result, because Cascade Lake has VNNI instruction vpdpbusd
that accumulates the intermediate products directly in int32
and doesn't have this issue. ✔️
Unfortunately, if we want to utilize int8 instructions on Skylake (to have higher performance than f32 analogues) we don't have any other options than somehow deal with the saturation.
For int8 convolutions we perform the following trick:
This approach loses some accuracy (basically one bit of the source matrix), but at least doesn't give completely incorrect results.
Alternative way is computing integer gemm by converting the int8 values to int16 and directly applying the vpmaddwd
instruction. While this would give pretty accurate results, the performance would be on par with f32 sgemm.
Thanks for the detailed explanation! Currently we use s8s8s32 gemm from the older version of MKL-DNN (somewhere around July 2018 I believe), and it seems to work. Was it completely different algorithm at that time?
Right now we have at least two tests depending on s8s8s32 in Tensorflow that consistently pass:
In the old MKL-DNN version you mentioned, we used a reference implementation of the s8s8s32 gemm, which didn't have the saturation issue described by Evarist, but in 2bd1840177b0130102e50418a04fa954dfa3dfab we introduced an optimized version of the s8s8s32 gemm which unfortunately has one.
Hi @ezhulenev,
If I read the first test correctly the problem doesn't appear there because the data tensor is filled with values from 0 to 127 -- since min f32 value is 0, but the expected minimum set to -1.
The corresponding code from the test:
conv_input, _, _ = gen_array_ops.quantize_v2(
..., minval=-0.0, maxval=1.0, dtype=dtypes.float32),
-1.0, 1.0, dtypes.qint8)
In this case instead of having s8*u8 + s8*u8 -> s16
the test has s8*u7 + s8*u7 -> s16
which cannot overflow.
The second test covers the whole range of data, hence catches the overflow issue.
Do you think any of the workarounds can be applicable for TF to use s8{s,u}8_igemm
, which are:
Alternative to the workarounds would be to have an optimized igemm that doesn't have this overflow issue at all (by direct up-converting i8 to i16). But the problem is that the theoretical (Fl)ops peak of such igemm would be the same as sgemm on <AVX512-VNNI systems, i.e. no significant benefit from using quantization.
I think the only use case for now in Tensorflow is FusedConv2DBiasActivation, which was originally done only for GPUs, and because of similar cuDNN/TensorRT limitations it seems that it always receives input in [0, 128) range... though it's not a part of Tensorflow OP signature.
If it's indeed the case, I'll move it to s8u8s32_gemm (actually s8u7s32_gemm).
Thanks for the update, @ezhulenev!
Well, then we need to update the documentation to highlight the potential issue.
Output:
But the correct MatMul result is: