Open penpornk opened 1 year ago
Thank you for the report and the reproducer, @penpornk.
+@akharito, @aaraujom
@penpornk - Thanks for the reproducer we can reproduce same output on our side. We are taking a look.
Tagging @dzarukin to keep him in the loop.
Hi @penpornk and thanks for the report. In oneDNN, we typically have no requirement on the order on which values are accumulated. In GEMM/Matmul/Convolution it can manifest itself by:
Adding requirements on the order of accumulation could and typically has an impact on oneDNN performance. Could you clarify how this particular behavior wrt the order of accumulation affects your application? Can you confirm that the accuracy drop in the end-user application is indeed due to the different order of accumulation between tails and blocks? Or is it a more general issue wrt order of accumulation?
Edit: in general, if the end user application accuracy is impacted by the order of accumulation, I would be tempted to say that this application is not very numerically stable. Even if we fix the order of accumulation in oneDNN, we still might pick the "wrong" order for a given application to reach maximum accuracy.
Thanks @mgouicem.
This error prevents a constant compression optimization of a constant folded matrix-multiply whose LHS is a matrix with the rows broadcasted (so the rows are just repeats).
If the LHS of a matrix-multiply has rows that are all the same, than the output will also have rows that are all the same. From there, we can compress that constant in a further optimization, which due to this discrepancy we cannot since the last row of the constant-folded matrix-multiply is not bit-exact with the other rows.
This is problematic specifically because constant folding in grappler relies on the TF MatMul kernel for CPUs which invokes GEMM.
Outside of compiler-optimizations around constants which depend on bit-exactness, which could warrant not using gemm for such purposes, I suspect there are researchers relying on broadcasting their arrays before feeding to a matrix-multiply.
Even if we fix the order of accumulation in oneDNN, we still might pick the "wrong" order for a given application to reach maximum accuracy.
Sure but so is non-determinism in the output of a matrix-multiply where the LHS has all rows repeated but the output does not.
I'm happy with a mode where we can guarantee determinism, but I'm also happy to hear more from folks working on Grappler, MLIR's TF dialect, other compilers leveraging gemm for constant folding, or researchers relying on the accuracy of operations.
This error prevents a constant compression optimization of a constant folded matrix-multiply whose LHS is a matrix with the rows broadcasted (so the rows are just repeats).
If the LHS of a matrix-multiply has rows that are all the same, than the output will also have rows that are all the same. From there, we can compress that constant in a further optimization, which due to this discrepancy we cannot since the last row of the constant-folded matrix-multiply is not bit-exact with the other rows.
For this particular case, it would seem better to swap the matrix-multiplication and broadcast operations. This would both save some flops and guarantee that all rows are identical. Is it something doable on your side?
This is problematic specifically because constant folding in grappler relies on the TF MatMul kernel for CPUs which invokes GEMM.
Sure but so is non-determinism in the output of a matrix-multiply where the LHS has all rows repeated but the output does not.
I'm happy with a mode where we can guarantee determinism
If swapping broadcast and Matmul is not an option on your side, we can investigate if preserving accumulation order in tails does not hurt performance too much. If it does, we would likely have to expose a mode as you mentioned, which would likely dispatch an unoptimized implementation. Is that a viable option for TF?
@aaraujom @akharito on their perspective wrt to this suggestion.
From gemm side it would be non-trivial to guarantee same order of computations and it would have performance implications. Not only tail handling would have to be changed, but we would have to be careful on how we perform parallel reductions as well.
On small sizes and with multithreading we might end up with kernel executing one main block plus a tail. Low performance on tail side would reduce overall performance I think.
For this particular case, it would seem better to swap the matrix-multiplication and broadcast operations.
It's tricky because constant folding operates under the assumption that all inputs to an operation is a constant so it has to follow breadth-first-order. We likely can't re-order this for all the ways a compiler, during the compilation process, can generate this IR fragment (for all such IR formats/dialects). It might be possible by adding a canonicalization rewrite pattern, but its still cumbersome since any ML compiler leveraging gemm for constant folding would have to buy into this at all stages of IR transforms that deal with explicit matmul ops.
I also want to be mindful of the user experience of researchers since they need to be aware of this if they want accurate lhs broadcasted matmuls in practice.
we can investigate if preserving accumulation order in tails does not hurt performance too much. If it does, we would likely have to expose a mode as you mentioned, which would likely dispatch an unoptimized implementation.
Thanks, I completely appreciate the effort! I think an unoptimized implementation of the kernel can be used for constant folding, but I think someone on the TF team, or working on grappler, or the TF MLIR dialect might be a better authority.
trivial to guarantee same order of computations and it would have performance implications. Not only tail handling would have to be changed, but we would have to be careful on how we perform parallel reductions as well.
Right, I figured the order of accumulate is the tricky bit here.
On small sizes and with multithreading we might end up with kernel executing one main block plus a tail. Low performance on tail side would reduce overall performance I think.
Makes sense, that ordering affects the distribution of work and we get throttled on the tail.
Just to make double-sure we are trying to solve the right problem (@penpornk @rahulpalamuttam please confirm):
Is the issue with sgemm(broadcast_rows(A), B) != broadcast_rows(sgemm(A, B))
, in other words, if A rows are all equal, a user would expect all rows of the output to be equal too? Same with sgemm(A, broadcast_cols(B)) != broadcast_cols(sgemm(A, B))
, where a user would expect that if B columns are all equal, all columns of the output should be equal too.
If this is correct, why is this property necessary ? (other than it makes sense :-) ). In particular, why is it required for constant folding, but not for non constant computation? Just trying to understand when this property is required, to understand what possible solution we can offer, and what are the performance tradeoffs involved.
Makes sense, that ordering affects the distribution of work and we get throttled on the tail.
From oneDNN perspective, the issue is not with parallelization over K: sgemm implementations always add partial sums in order so that results are run-to-run reproducible (running the same program twice on the same system/environment with the same inputs return the same result). The issue here is mostly that for M/N tails, we increase the number of accumulators in register by splitting the reduction over K in two. By having two sets of accumulator registers (accumulation over even K indices, and accumulators over odd K indices), it allows the sgemm tail kernel to better interleave instructions and better use the CPU execution ports.
Is the issue with sgemm(broadcast_rows(A), B) != broadcast_rows(sgemm(A, B)), in other words, if A rows are all equal, a user would expect all rows of the output to be equal too?
Yes
Same with sgemm(A, broadcast_cols(B)) != broadcast_cols(sgemm(A, B))
I haven't explicitly tested it with cols, but I imagine it happens as well.
In particular, why is it required for constant folding, but not for non constant computation?
It's required for both. Which is why for the latter I want input from ML practitioners and not just compiler-folk. Having been one at some time, tiny deltas like those can balloon into bigger non-determinisms in model training which is not so fun to deal with. Ofc, outside of the world of ML and compilers, sgemm is important in a variety of other fields :)
Just trying to understand when this property is required, to understand what possible solution we can offer, and what are the performance tradeoffs involved.
At risk of over-claiming, I think this property is always required/preferred, but I agree there are also instances where performance is better than absolute, deterministic accuracy.
We absolutely want deterministic accuracy for constant folding and so I think a flag or different implementation to turn-on for TF Kernels used during constant folding should be ok in the short-term.
Fwiw, the original bug was in a ML research use-case over a year ago. I just re-discovered it again for constant folding. This is why I want to say its always required/preferred, but folks with more authority on both areas can weigh in.
ping @mgouicem
we can investigate if preserving accumulation order in tails does not hurt performance too much
I'm curious if there's some initial investigation results.
Hi @rahulpalamuttam sorry no results yet. We will update this thread as we get some data.
Hi @mgouicem,
Happy new year!
I was wondering if you were able to gather any data for this issue.
Cheers,
Rahul P
Hi @rahulpalamuttam - I didn't look into this issue in regards performance. However, we have internal plans to move to move matmul and inner product implementations to brgemm kernels that don't present the issue you mentioned about it.
However, we have internal plans to move to move matmul and inner product implementations to brgemm kernels that don't present the issue you mentioned about it.
Do you mind sharing any rough outline of timelines here?
I'ld like a resolution to this bug, and if it means changing the matmul kernels used it would be nice to know when TF should switch over?
@penpornk
Do you mind sharing any rough outline of timelines here?
From oneDNN side we have plans to have avx2 using brgemm kernels by the end of Q1 if everything goes well.
I'ld like a resolution to this bug
Although the results are not bitwise identical for the scenario in the reproducer, the results computed are actually more accurate for the tail parts.
The exact entries for result matrix C is $506.88 = \frac{11}{10} \frac{9}{10} 512$. The error at the main parts is $506.881195 - 506.88 = 0.0011949999999956$ and the error at the tail parts is $506.879639 - 506.88 = 0.00036099999999806$. It is more accurate for the tails because we can accumulate in 2 sets of registers and then add them together instead of accumulation one by one.
Changing to a kernel that always accumulate on same order (along k-dimension) wouldn't necessarily improve accuracy, but would make elements of the matrix the same.
Happy new year all! :-)
@aaraujom , I don't think this is an accuracy issue, but mostly a semantic issue. @rahulpalamuttam wants to compute a broadcast operation followed by a matmul operation (see this comment). Hence, deterministic dot product within an sgemm computation is expected here. The accumulation order / accuracy does not matter here, as long as we use the same order across rows/columns.
@rahulpalamuttam wants to compute a broadcast operation followed by a matmul operation (see https://github.com/oneapi-src/oneDNN/issues/1486#issuecomment-1306301809). Hence, deterministic dot product within an sgemm computation is expected here. The accumulation order / accuracy does not matter here, as long as we use the same order across rows/columns.
+100
FYI: Facing same issue with onednn 3.0.0. Question: does it matter in M, K or N is multiple of 8 ?
@WilliamTambellini - The sizes of M
and N
are relevant. If N % 16 > 8
and M % 6 > 3
that would avoid the issue as far as I understand.
The problem comes from accumulating along K
dimension in different orders in kernel_16x[123]
and kernel_8x*
from jit_avx_gemm_f32.cpp. Accumulation order difference comes from here if I remember correctly. (Note that kernels are written in column major ordering and dnnl_sgemm
expects row major ordering, so swap M
and N
if checking the kernels.)
I'm not sure I missed another part of the kernels that changes accumulation order, but if the above doesn't work, avoiding tail cases completely (N % 16 == 0
and M % 6 == 0
) should avoid any special tail handling that changes accumulation order in K
dimension.
Anyways, that's not a fix. We are adding avx2 BRGEMM support for matmul, but it is not quite ready yet.
Hi @aaraujom , Is the fix ready for avx2 brgemm yet? Can you please also point out the area in brgemm where this accumulation issue can be handled?
Is the fix ready for avx2 brgemm yet?
Unfortunately not yet for matmul. We add brgemm avx2 kernels, but still need some work for enabling it for matmul.
Can you please also point out the area in brgemm where this accumulation issue can be handled?
brgemm kernels are different than gemm kernels. brgemm kernels keep the same order of accumulation along reduce dimension (k-dim), which avoid the issue reported above.
Summary
For some matrix sizes, sgemm accuracy drops with AVX2. The simplest case being when all elements in the result matrix should be the same. This can affect the accuracy of downstream applications, e.g., constant folding in TF.
Version
oneDNN v2.7.1 (commit a96c94f39773df35760d037d1ebe580dcf79c137)
Environment
oneDNN includes hardware-specific optimizations and may behave differently on depending on the compiler and build environment. Include the following information to help reproduce the issue:
lscpu
; if yourlscpu
does not list CPU flags, try runningcat /proc/cpuinfo | grep flags | sort -u
)uname -a
): Debian 5.18gcc --version
): gcc 12.2.0cmake --version
): 3.24.2git log -1 --format=%H
): a96c94f39773df35760d037d1ebe580dcf79c137Steps to reproduce
Call
wrong_results()
in the following code:Observed behavior
The resulting C matrix doesn't have all same value. A different value appears in fringe rows and columns. (Fringe rows change with
m
(total number of rows of C) and the number of OpenMP threads.)Example output
m = 20
, #threads = 1:m=100
, #threads = 16:Expected behavior
All elements in the matrix should have the same value.