oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.59k stars 990 forks source link

sgemm accuracy drops with AVX2 #1486

Open penpornk opened 1 year ago

penpornk commented 1 year ago

Summary

For some matrix sizes, sgemm accuracy drops with AVX2. The simplest case being when all elements in the result matrix should be the same. This can affect the accuracy of downstream applications, e.g., constant folding in TF.

Version

oneDNN v2.7.1 (commit a96c94f39773df35760d037d1ebe580dcf79c137)

Environment

oneDNN includes hardware-specific optimizations and may behave differently on depending on the compiler and build environment. Include the following information to help reproduce the issue:

Steps to reproduce

Call wrong_results() in the following code:

#include <cmath>
#include <cstdio>
#include "oneapi/dnnl/dnnl.h"

void print_row(float* mat, int m, int n, int i) {
  int count = 1;
  float* row_i = mat + (i * n);
  printf("Row %3d: ", i);
  for (int j = 1; j < n; ++j) {
    if (fabs(row_i[j] - row_i[j - 1]) > 3e-7) {
      printf("%f (%d times), ", row_i[j - 1], count);
      count = 1;
    } else {
      ++count;
    }
  }
  printf("%f (%d times)\n", row_i[n - 1], count);
}

void wrong_result() {
  const int m = 20;
  const int n = 18;
  const int k = 512;

  // Initialize matrices.
  float* A = new float[m * k];
  float* B = new float[k * n];
  float* C = new float[m * n];
  for (int i = 0; i < m * k; ++i) A[i] = 1.1f;
  for (int i = 0; i < k * n; ++i) B[i] = 0.9f;
  for (int i = 0; i < m * n; ++i) C[i] = 0.0f;

  // Set up matrix multiplication parameters.
  const float alpha = 1.0f;
  const float beta = 0.0f;
  const char transa = 'N';
  const char transb = 'N';
  const int lda = k;  // Row major.
  const int ldb = n;  // Row major.
  const int ldc = n;  // Row major.

  // Call oneDNN sgemm
  dnnl_sgemm(transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

  // Print results
  for (int i = 0; i < m; ++i) print_row(C, m, n, i);

  float diff = fabs(C[0] - C[m * n - 1]);
  printf("Absolute diff: %g\n", diff);
  printf("Relative diff: %g\n", diff / C[0]);

  // Free matrices
  delete[] A;
  delete[] B;
  delete[] C;
}

Observed behavior

The resulting C matrix doesn't have all same value. A different value appears in fringe rows and columns. (Fringe rows change with m (total number of rows of C) and the number of OpenMP threads.)

Example output

m = 20, #threads = 1:

onednn_verbose,info,oneDNN v2.7.1 (commit a96c94f39773df35760d037d1ebe580dcf79c137)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX2
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,exec,cpu,gemm_api,,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 dst_f32::blocked:ab:f0,,20x512:512x18,21.814
Row   0: 506.881195 (16 times), 506.879639 (2 times)
Row   1: 506.881195 (16 times), 506.879639 (2 times)
Row   2: 506.881195 (16 times), 506.879639 (2 times)
Row   3: 506.881195 (16 times), 506.879639 (2 times)
Row   4: 506.881195 (16 times), 506.879639 (2 times)
Row   5: 506.881195 (16 times), 506.879639 (2 times)
Row   6: 506.881195 (16 times), 506.879639 (2 times)
Row   7: 506.881195 (16 times), 506.879639 (2 times)
Row   8: 506.881195 (16 times), 506.879639 (2 times)
Row   9: 506.881195 (16 times), 506.879639 (2 times)
Row  10: 506.881195 (16 times), 506.879639 (2 times)
Row  11: 506.881195 (16 times), 506.879639 (2 times)
Row  12: 506.881195 (16 times), 506.879639 (2 times)
Row  13: 506.881195 (16 times), 506.879639 (2 times)
Row  14: 506.881195 (16 times), 506.879639 (2 times)
Row  15: 506.881195 (16 times), 506.879639 (2 times)
Row  16: 506.881195 (16 times), 506.879639 (2 times)
Row  17: 506.881195 (16 times), 506.879639 (2 times)
Row  18: 506.879639 (18 times)
Row  19: 506.879639 (18 times)
Absolute diff: 0.0015564
Relative diff: 3.07054e-06

m=100, #threads = 16:

onednn_verbose,info,oneDNN v2.7.1 (commit a96c94f39773df35760d037d1ebe580dcf79c137)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:16
onednn_verbose,info,cpu,isa:Intel AVX2
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,exec,cpu,gemm_api,,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 dst_f32::blocked:ab:f0,,100x512:512x18,35.9971
Row   0: 506.881195 (16 times), 506.879639 (2 times)
Row   1: 506.881195 (16 times), 506.879639 (2 times)
Row   2: 506.881195 (16 times), 506.879639 (2 times)
Row   3: 506.881195 (16 times), 506.879639 (2 times)
Row   4: 506.881195 (16 times), 506.879639 (2 times)
Row   5: 506.881195 (16 times), 506.879639 (2 times)
Row   6: 506.879639 (18 times)
Row   7: 506.881195 (16 times), 506.879639 (2 times)
Row   8: 506.881195 (16 times), 506.879639 (2 times)
Row   9: 506.881195 (16 times), 506.879639 (2 times)
Row  10: 506.881195 (16 times), 506.879639 (2 times)
Row  11: 506.881195 (16 times), 506.879639 (2 times)
Row  12: 506.881195 (16 times), 506.879639 (2 times)
Row  13: 506.879639 (18 times)
...
Row  91: 506.881195 (16 times), 506.879639 (2 times)
Row  92: 506.881195 (16 times), 506.879639 (2 times)
Row  93: 506.881195 (16 times), 506.879639 (2 times)
Row  94: 506.881195 (16 times), 506.879639 (2 times)
Row  95: 506.881195 (16 times), 506.879639 (2 times)
Row  96: 506.881195 (16 times), 506.879639 (2 times)
Row  97: 506.879639 (18 times)
Row  98: 506.879639 (18 times)
Row  99: 506.879639 (18 times)
Absolute diff: 0.0015564
Relative diff: 3.07054e-06

Expected behavior

All elements in the matrix should have the same value.

vpirogov commented 1 year ago

Thank you for the report and the reproducer, @penpornk.

+@akharito, @aaraujom

aaraujom commented 1 year ago

@penpornk - Thanks for the reproducer we can reproduce same output on our side. We are taking a look.

Tagging @dzarukin to keep him in the loop.

mgouicem commented 1 year ago

Hi @penpornk and thanks for the report. In oneDNN, we typically have no requirement on the order on which values are accumulated. In GEMM/Matmul/Convolution it can manifest itself by:

Adding requirements on the order of accumulation could and typically has an impact on oneDNN performance. Could you clarify how this particular behavior wrt the order of accumulation affects your application? Can you confirm that the accuracy drop in the end-user application is indeed due to the different order of accumulation between tails and blocks? Or is it a more general issue wrt order of accumulation?

Edit: in general, if the end user application accuracy is impacted by the order of accumulation, I would be tempted to say that this application is not very numerically stable. Even if we fix the order of accumulation in oneDNN, we still might pick the "wrong" order for a given application to reach maximum accuracy.

rahulpalamuttam commented 1 year ago

Thanks @mgouicem.

This error prevents a constant compression optimization of a constant folded matrix-multiply whose LHS is a matrix with the rows broadcasted (so the rows are just repeats).

If the LHS of a matrix-multiply has rows that are all the same, than the output will also have rows that are all the same. From there, we can compress that constant in a further optimization, which due to this discrepancy we cannot since the last row of the constant-folded matrix-multiply is not bit-exact with the other rows.

This is problematic specifically because constant folding in grappler relies on the TF MatMul kernel for CPUs which invokes GEMM.

Outside of compiler-optimizations around constants which depend on bit-exactness, which could warrant not using gemm for such purposes, I suspect there are researchers relying on broadcasting their arrays before feeding to a matrix-multiply.

Even if we fix the order of accumulation in oneDNN, we still might pick the "wrong" order for a given application to reach maximum accuracy.

Sure but so is non-determinism in the output of a matrix-multiply where the LHS has all rows repeated but the output does not.

I'm happy with a mode where we can guarantee determinism, but I'm also happy to hear more from folks working on Grappler, MLIR's TF dialect, other compilers leveraging gemm for constant folding, or researchers relying on the accuracy of operations.

mgouicem commented 1 year ago

This error prevents a constant compression optimization of a constant folded matrix-multiply whose LHS is a matrix with the rows broadcasted (so the rows are just repeats).

If the LHS of a matrix-multiply has rows that are all the same, than the output will also have rows that are all the same. From there, we can compress that constant in a further optimization, which due to this discrepancy we cannot since the last row of the constant-folded matrix-multiply is not bit-exact with the other rows.

For this particular case, it would seem better to swap the matrix-multiplication and broadcast operations. This would both save some flops and guarantee that all rows are identical. Is it something doable on your side?

This is problematic specifically because constant folding in grappler relies on the TF MatMul kernel for CPUs which invokes GEMM.

Sure but so is non-determinism in the output of a matrix-multiply where the LHS has all rows repeated but the output does not.

I'm happy with a mode where we can guarantee determinism

If swapping broadcast and Matmul is not an option on your side, we can investigate if preserving accumulation order in tails does not hurt performance too much. If it does, we would likely have to expose a mode as you mentioned, which would likely dispatch an unoptimized implementation. Is that a viable option for TF?

@aaraujom @akharito on their perspective wrt to this suggestion.

aaraujom commented 1 year ago

From gemm side it would be non-trivial to guarantee same order of computations and it would have performance implications. Not only tail handling would have to be changed, but we would have to be careful on how we perform parallel reductions as well.

On small sizes and with multithreading we might end up with kernel executing one main block plus a tail. Low performance on tail side would reduce overall performance I think.

rahulpalamuttam commented 1 year ago

For this particular case, it would seem better to swap the matrix-multiplication and broadcast operations.

It's tricky because constant folding operates under the assumption that all inputs to an operation is a constant so it has to follow breadth-first-order. We likely can't re-order this for all the ways a compiler, during the compilation process, can generate this IR fragment (for all such IR formats/dialects). It might be possible by adding a canonicalization rewrite pattern, but its still cumbersome since any ML compiler leveraging gemm for constant folding would have to buy into this at all stages of IR transforms that deal with explicit matmul ops.

I also want to be mindful of the user experience of researchers since they need to be aware of this if they want accurate lhs broadcasted matmuls in practice.

we can investigate if preserving accumulation order in tails does not hurt performance too much. If it does, we would likely have to expose a mode as you mentioned, which would likely dispatch an unoptimized implementation.

Thanks, I completely appreciate the effort! I think an unoptimized implementation of the kernel can be used for constant folding, but I think someone on the TF team, or working on grappler, or the TF MLIR dialect might be a better authority.

trivial to guarantee same order of computations and it would have performance implications. Not only tail handling would have to be changed, but we would have to be careful on how we perform parallel reductions as well.

Right, I figured the order of accumulate is the tricky bit here.

On small sizes and with multithreading we might end up with kernel executing one main block plus a tail. Low performance on tail side would reduce overall performance I think.

Makes sense, that ordering affects the distribution of work and we get throttled on the tail.

mgouicem commented 1 year ago

Just to make double-sure we are trying to solve the right problem (@penpornk @rahulpalamuttam please confirm): Is the issue with sgemm(broadcast_rows(A), B) != broadcast_rows(sgemm(A, B)), in other words, if A rows are all equal, a user would expect all rows of the output to be equal too? Same with sgemm(A, broadcast_cols(B)) != broadcast_cols(sgemm(A, B)), where a user would expect that if B columns are all equal, all columns of the output should be equal too.

If this is correct, why is this property necessary ? (other than it makes sense :-) ). In particular, why is it required for constant folding, but not for non constant computation? Just trying to understand when this property is required, to understand what possible solution we can offer, and what are the performance tradeoffs involved.

Makes sense, that ordering affects the distribution of work and we get throttled on the tail.

From oneDNN perspective, the issue is not with parallelization over K: sgemm implementations always add partial sums in order so that results are run-to-run reproducible (running the same program twice on the same system/environment with the same inputs return the same result). The issue here is mostly that for M/N tails, we increase the number of accumulators in register by splitting the reduction over K in two. By having two sets of accumulator registers (accumulation over even K indices, and accumulators over odd K indices), it allows the sgemm tail kernel to better interleave instructions and better use the CPU execution ports.

rahulpalamuttam commented 1 year ago

Is the issue with sgemm(broadcast_rows(A), B) != broadcast_rows(sgemm(A, B)), in other words, if A rows are all equal, a user would expect all rows of the output to be equal too?

Yes

Same with sgemm(A, broadcast_cols(B)) != broadcast_cols(sgemm(A, B))

I haven't explicitly tested it with cols, but I imagine it happens as well.

In particular, why is it required for constant folding, but not for non constant computation?

It's required for both. Which is why for the latter I want input from ML practitioners and not just compiler-folk. Having been one at some time, tiny deltas like those can balloon into bigger non-determinisms in model training which is not so fun to deal with. Ofc, outside of the world of ML and compilers, sgemm is important in a variety of other fields :)

Just trying to understand when this property is required, to understand what possible solution we can offer, and what are the performance tradeoffs involved.

At risk of over-claiming, I think this property is always required/preferred, but I agree there are also instances where performance is better than absolute, deterministic accuracy.

We absolutely want deterministic accuracy for constant folding and so I think a flag or different implementation to turn-on for TF Kernels used during constant folding should be ok in the short-term.

Fwiw, the original bug was in a ML research use-case over a year ago. I just re-discovered it again for constant folding. This is why I want to say its always required/preferred, but folks with more authority on both areas can weigh in.

rahulpalamuttam commented 1 year ago

ping @mgouicem

we can investigate if preserving accumulation order in tails does not hurt performance too much

I'm curious if there's some initial investigation results.

mgouicem commented 1 year ago

Hi @rahulpalamuttam sorry no results yet. We will update this thread as we get some data.

rahulpalamuttam commented 1 year ago

Hi @mgouicem,

Happy new year!

I was wondering if you were able to gather any data for this issue.

Cheers,

Rahul P

aaraujom commented 1 year ago

Hi @rahulpalamuttam - I didn't look into this issue in regards performance. However, we have internal plans to move to move matmul and inner product implementations to brgemm kernels that don't present the issue you mentioned about it.

rahulpalamuttam commented 1 year ago

However, we have internal plans to move to move matmul and inner product implementations to brgemm kernels that don't present the issue you mentioned about it.

Do you mind sharing any rough outline of timelines here?

I'ld like a resolution to this bug, and if it means changing the matmul kernels used it would be nice to know when TF should switch over?

@penpornk

aaraujom commented 1 year ago

Do you mind sharing any rough outline of timelines here?

From oneDNN side we have plans to have avx2 using brgemm kernels by the end of Q1 if everything goes well.

I'ld like a resolution to this bug

Although the results are not bitwise identical for the scenario in the reproducer, the results computed are actually more accurate for the tail parts.

The exact entries for result matrix C is $506.88 = \frac{11}{10} \frac{9}{10} 512$. The error at the main parts is $506.881195 - 506.88 = 0.0011949999999956$ and the error at the tail parts is $506.879639 - 506.88 = 0.00036099999999806$. It is more accurate for the tails because we can accumulate in 2 sets of registers and then add them together instead of accumulation one by one.

Changing to a kernel that always accumulate on same order (along k-dimension) wouldn't necessarily improve accuracy, but would make elements of the matrix the same.

mgouicem commented 1 year ago

Happy new year all! :-)

@aaraujom , I don't think this is an accuracy issue, but mostly a semantic issue. @rahulpalamuttam wants to compute a broadcast operation followed by a matmul operation (see this comment). Hence, deterministic dot product within an sgemm computation is expected here. The accumulation order / accuracy does not matter here, as long as we use the same order across rows/columns.

rahulpalamuttam commented 1 year ago

@rahulpalamuttam wants to compute a broadcast operation followed by a matmul operation (see https://github.com/oneapi-src/oneDNN/issues/1486#issuecomment-1306301809). Hence, deterministic dot product within an sgemm computation is expected here. The accumulation order / accuracy does not matter here, as long as we use the same order across rows/columns.

+100

WilliamTambellini commented 1 year ago

FYI: Facing same issue with onednn 3.0.0. Question: does it matter in M, K or N is multiple of 8 ?

aaraujom commented 1 year ago

@WilliamTambellini - The sizes of M and N are relevant. If N % 16 > 8 and M % 6 > 3 that would avoid the issue as far as I understand.

The problem comes from accumulating along K dimension in different orders in kernel_16x[123] and kernel_8x* from jit_avx_gemm_f32.cpp. Accumulation order difference comes from here if I remember correctly. (Note that kernels are written in column major ordering and dnnl_sgemm expects row major ordering, so swap M and N if checking the kernels.)

I'm not sure I missed another part of the kernels that changes accumulation order, but if the above doesn't work, avoiding tail cases completely (N % 16 == 0 and M % 6 == 0) should avoid any special tail handling that changes accumulation order in K dimension.

Anyways, that's not a fix. We are adding avx2 BRGEMM support for matmul, but it is not quite ready yet.

Shreyas-fuj commented 11 months ago

Hi @aaraujom , Is the fix ready for avx2 brgemm yet? Can you please also point out the area in brgemm where this accumulation issue can be handled?

aaraujom commented 10 months ago

Is the fix ready for avx2 brgemm yet?

Unfortunately not yet for matmul. We add brgemm avx2 kernels, but still need some work for enabling it for matmul.

Can you please also point out the area in brgemm where this accumulation issue can be handled?

brgemm kernels are different than gemm kernels. brgemm kernels keep the same order of accumulation along reduce dimension (k-dim), which avoid the issue reported above.