Open yzhang93 opened 5 months ago
@MaheshRavishankar @newling @erwei-xilinx The above is a summary of my findings after some experiments with the intermittent numerical problem. Let me know if you have any ideas or suggestions to debug this.
I've just seen a failure with pad-pack:
https://github.com/nod-ai/iree-amd-aie/actions/runs/9082298857/job/24958476227?pr=337
expected result (rows 55..63 out of 0..63, columns 24..39 out of 0..63)
-20 -30 0 10 -7 13 -13 14 20 -23 -25 14 -3 -40 13 -2
-1 7 5 33 14 0 -9 -2 -7 -20 -12 28 29 8 19 0
13 -21 -36 -19 7 -8 6 -11 24 11 -7 2 -19 -9 0 4
-27 4 2 23 8 4 6 1 -2 -9 -18 -21 22 -8 4 17
-9 3 -7 -22 -10 -11 1 -14 4 -18 -12 -4 15 -30 -11 -7
-25 -14 -2 2 -9 32 -4 -14 37 17 -14 -3 4 -28 -6 -8
-32 32 -20 3 37 -17 11 -42 -17 -1 7 -19 29 24 4 12
-3 -6 -8 7 -15 14 -1 -8 21 -1 -11 -24 -19 -14 12 22
13 -14 3 -3 2 -3 43 -17 16🦄 10🦄 29🦄 4🦄 6🦄 7🦄 -14🦄 -17🦄
actual result (rows 55..63 out of 0..63, columns 24..39 out of 0..63)
-20 -30 0 10 -7 13 -13 14 20 -23 -25 14 -3 -40 13 -2
-1 7 5 33 14 0 -9 -2 -7 -20 -12 28 29 8 19 0
13 -21 -36 -19 7 -8 6 -11 24 11 -7 2 -19 -9 0 4
-27 4 2 23 8 4 6 1 -2 -9 -18 -21 22 -8 4 17
-9 3 -7 -22 -10 -11 1 -14 4 -18 -12 -4 15 -30 -11 -7
-25 -14 -2 2 -9 32 -4 -14 37 17 -14 -3 4 -28 -6 -8
-32 32 -20 3 37 -17 11 -42 -17 -1 7 -19 29 24 4 12
-3 -6 -8 7 -15 14 -1 -8 21 -1 -11 -24 -19 -14 12 22
13 -14 3 -3 2 -3 43 -17 0🐞 0🐞 0🐞 0🐞 0🐞 0🐞 0🐞 0🐞
iree/tools/testing/e2e/iree-e2e-matmul-test.cc:532: ABORTED; while calling import; while invoking C++ function matmul_test.check_matmul_results;
[ 1] native matmul_test.check_matmul_results:0 -
[ 0] bytecode calls.matmul_64x64_64xi8__64_64_64_0:140 /home/github/actions-runner/_work/iree-amd-aie/iree-amd-aie/test1/mm_int8_i8_i32_m64_n64_k64_calls.mlir:10:1
If the vmfb is the same there are only two issues it can be 1) There is some uninitialized data being read 2) This is a driver issue.
The pattern of failure (rows of the tensor) looks suspiciously like the pattern I saw previously. See issue https://github.com/nod-ai/iree-amd-aie/issues/209 for a summary of findings in that case. In the end, @nirvedhmeshram and @daveliddell and I decided to just update the driver on nuc50, because that made the intermittent issue go away (we believed).
We've been noticing an intermittent numerical error with pack-peel pipeline and data type bf16. To reproduce the problem, one can run this test locally for multiple times and the test results and error are as below:
Note:
The problem seems to only happen with pack-peel pipeline and data type bf16. I haven't seen such problem with pad-pack pipeline so far.