Incorrect result of s8s8s32 gemm?

ezhulenev commented 5 years ago

  using MatrixInt32 = Eigen::Matrix<int, Dynamic, Dynamic, ColMajor>;
  using MatrixInt8 = Eigen::Matrix<int8_t, Dynamic, Dynamic, ColMajor>;

  MatrixInt8 lhs_int8(4, 4);
  lhs_int8 <<  -47, -123,  -10,   92,
               -96,  -38,   82,   40,
                70,  -60,  -66, -127,
               -46,  -91,  109,  101;

  MatrixInt8 rhs_int8(4, 4);
  rhs_int8 <<  124,   8, -74,  32,
               -40,   4, -50, -84,
               103, -84,  88, -66,
                24, 115, 123,  54;

  std::cout << "LHS: \n" << lhs_int8.cast<int32_t>() << "\n";
  std::cout << "RHS: \n" << rhs_int8.cast<int32_t>() << "\n";

  MatrixInt32 out(4, 4);
  out.setZero();

  char transposeA = 'N';
  char transposeB = 'N';

  const int m = 4;
  const int n = 4;
  const int k = 4;

  const int ldA = 4;
  const int ldB = 4;
  const int ldC = 4;

  const float beta = 1.0;

  // Symmetric quantization with zero point at 0.
  const int8_t ao = 0;
  const int8_t bo = 0;

  // Don't add any offset to the result C.
  const char offsetc = 'F';
  const int32_t co = 0;

  const int8_t* A = lhs_int8.data();
  const int8_t* B = rhs_int8.data();
  int32_t* C = out.data();

  const float alpha = 1.0;

  mkldnn_gemm_s8s8s32(&transposeA, &transposeB, &offsetc,  //
                      &m, &n, &k,                          //
                      &alpha,                              //
                      A, &ldA, &ao,                        //
                      B, &ldB, &bo,                        //
                      &beta,                               //
                      C, &ldC, &co);

  std::cout << "OUT: \n" << out << "\n";

Output:

LHS: 
 -47 -123  -10   92
 -96  -38   82   40
  70  -60  -66 -127
 -46  -91  109  101
RHS: 
124   8 -74  32
-40   4 -50 -84
103 -84  88 -66
 24 115 123  54
OUT: 
   270  10552  20064  14456
  -978  -3208  21140  -3132
  3016  -7744 -10244   4778
  3823   1727  13841   4432

But the correct MatMul result is:

   270     10552      20064    14456
  -978      -3208      21140    -3132
 1234      -8741     -23609    4778
11587       1727      29969    4432

vpirogov commented 5 years ago

@ezhulenev, thanks for the report. Could you please clarify which version of the library are you using? Does it reproduce with the latest master?

ezhulenev commented 5 years ago

I'm testing with https://github.com/intel/mkl-dnn/commit/7de7e5d02bf687f971e7668963649728356e0c20, will check with the latest revision tomorrow.

densamoilov commented 5 years ago

I've reproduced this issue with the latest version for the case when MKL-DNN uses MKL. We will look into this.

emfomenk commented 5 years ago

Hi @ezhulenev,

The problem relates to the vpmaddubsw instruction that is used in Intel MKL/Intel MKL-DNN igemms. Note, that the instruction performs the following: u8 * s8 + u8 * s8 -> s16 (with saturation).

Intel MKL / Intel MKL-DNN s8s8_igemm works as follows:

Add 128 to matrix B_s8 to make it B_u8
Compute C'_s32 <-- A_s8 * B_u8:
1. Every two elements of matrix A_s8 and two elements of B_u8 are taken to compute s16 value (using vpmaddubsw instruction)
2. s16 values are summed in s32 matrix C'_s32 (using vpmaddwd instruction)
Subtract 128*B_s8 from matrix C'_s32 to get matrix C_s32

The problem happens at step 2.i. due to saturation in vpmaddubsw (s8*u8 + s8*u8 might not fit into s16).

I've created a small reproducer with a reference code that can mimic Skylake behavior. See this gist, specifically igemm_repro.c at line 23. I also attached the output of Intel MKL-DNN on Broadwell (avx2), Skylake (avx512), and CascadeLake (avx512 VNNI).

For Skylake the library gives the result which corresponds to the shifted matrix B with the intermediate int16 saturation (compare the first and the last output). 🔴
For Broadwell Intel MKL-DNN gives the exact result, because we don't have int8 optimized igemm that uses these instructions. Instead a reference code is used. ✔️
- The funny fact is that Intel MKL (not Intel MKL-DNN) does have optimized igemm, which bahaves similarly to the one on Skylake. 🔴
For Cascade Lake the library again shows the exact result, because Cascade Lake has VNNI instruction vpdpbusd that accumulates the intermediate products directly in int32 and doesn't have this issue. ✔️

Unfortunately, if we want to utilize int8 instructions on Skylake (to have higher performance than f32 analogues) we don't have any other options than somehow deal with the saturation.

For int8 convolutions we perform the following trick:

Apply a scaling factor 1/2 for the source (which is then shifted to u8), and once the result is computed, multiply it by 2.

This approach loses some accuracy (basically one bit of the source matrix), but at least doesn't give completely incorrect results.

Alternative way is computing integer gemm by converting the int8 values to int16 and directly applying the vpmaddwd instruction. While this would give pretty accurate results, the performance would be on par with f32 sgemm.

ezhulenev commented 5 years ago

Thanks for the detailed explanation! Currently we use s8s8s32 gemm from the older version of MKL-DNN (somewhere around July 2018 I believe), and it seems to work. Was it completely different algorithm at that time?

Right now we have at least two tests depending on s8s8s32 in Tensorflow that consistently pass:

densamoilov commented 5 years ago

In the old MKL-DNN version you mentioned, we used a reference implementation of the s8s8s32 gemm, which didn't have the saturation issue described by Evarist, but in 2bd1840177b0130102e50418a04fa954dfa3dfab we introduced an optimized version of the s8s8s32 gemm which unfortunately has one.

emfomenk commented 5 years ago

Hi @ezhulenev,

If I read the first test correctly the problem doesn't appear there because the data tensor is filled with values from 0 to 127 -- since min f32 value is 0, but the expected minimum set to -1.

The corresponding code from the test:

conv_input, _, _ = gen_array_ops.quantize_v2(
  ..., minval=-0.0, maxval=1.0, dtype=dtypes.float32),
  -1.0, 1.0, dtypes.qint8)

In this case instead of having s8*u8 + s8*u8 -> s16 the test has s8*u7 + s8*u7 -> s16 which cannot overflow.

The second test covers the whole range of data, hence catches the overflow issue.

Do you think any of the workarounds can be applicable for TF to use s8{s,u}8_igemm, which are:

Scale one of the input by 0.5 and re-scale the output by 2. to avoid saturation with int16 accumulation. This is especially important for s8s8 case, since the overflow there leads to substantially different results. For s8u8 case this is less important, since the overflow would lead to slightly clipped results but not totally wrong ones.
- In theory, the scaling might be implemented inside or outside the mkldnn igemm.
Use s8u8 gemm instead of s8s8 and allow the clipping to happen (see the 1st bullet)
Require one input to be 7 bit instead of 8 bit.
... any other options I missed (cast @rsdubtso, @densamoilov)

Alternative to the workarounds would be to have an optimized igemm that doesn't have this overflow issue at all (by direct up-converting i8 to i16). But the problem is that the theoretical (Fl)ops peak of such igemm would be the same as sgemm on <AVX512-VNNI systems, i.e. no significant benefit from using quantization.

ezhulenev commented 5 years ago

I think the only use case for now in Tensorflow is FusedConv2DBiasActivation, which was originally done only for GPUs, and because of similar cuDNN/TensorRT limitations it seems that it always receives input in [0, 128) range... though it's not a part of Tensorflow OP signature.

If it's indeed the case, I'll move it to s8u8s32_gemm (actually s8u7s32_gemm).

emfomenk commented 5 years ago

Thanks for the update, @ezhulenev!

Well, then we need to update the documentation to highlight the potential issue.

oneapi-src / oneDNN

Incorrect result of s8s8s32 gemm? #476