nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
66 stars 29 forks source link

Pack-peel pipeline bf16 intermittent numerical error #347

Open yzhang93 opened 5 months ago

yzhang93 commented 5 months ago

We've been noticing an intermittent numerical error with pack-peel pipeline and data type bf16. To reproduce the problem, one can run this test locally for multiple times and the test results and error are as below:

error: the actual and expected result matrices disagree at row 48, column 16.

actual value: 0
expected value: -54

left-hand side (rows 40..55 out of 0..63, columns 0..15 out of 0..127)
 0    1    2   -2    1    2    1    2    2   -2    0   -1    2   -1    2   -1
-2    1    1   -1    1    1   -1    1   -2   -2   -2    2    0    2   -1    1
 0   -1    1    2    0    2    1    1   -1   -2    1    0    2    2   -1    2
 2   -1    1    2    0    2    0   -2    1    2    2    2    1   -2   -2    2
-1   -1   -1    1    0    2    0    1   -2   -1    2    2   -2    1    1    1
-1    0    1    2   -2    2   -2    2    1   -2   -1    0    1   -2    0    1
 2   -1   -1    1    0    2    0    1    1    0   -2   -1    2    1   -1   -2
 1    2    0   -1   -1   -2    0    2    0    2    2    1   -2    0   -1    2
-2    2   -2    0    0    1    1   -2    1    0   -2   -1    1   -1   -1    2
 2   -2   -1    2    1   -2    1   -2   -1    1    1    2   -2    1    2   -1
 2    0   -2    0    1    2   -1    0    1   -1    1    1   -2    1    0   -1
-1    0    1   -2   -2   -2    0   -2    0   -1    1    2   -1   -2   -1    2
 0   -2   -1   -2    2    2   -2   -1   -2    2    1   -1   -2    0    1    2
 2    2    2   -1    1    2    1   -2   -2   -1   -1   -1    1   -2    0    0
-2    2    0   -2    2   -1   -1   -1    2    2    2   -1    0   -2    1    1
 1   -2   -2    1    0   -1   -1    0    0    2    0   -2   -2    2    2    0

right-hand side (rows 0..15 out of 0..127, columns 8..23 out of 0..63)
-1    2   -2    1    0    2   -2   -1    0    2    1    0    0   -2   -1   -1
-1    0    2    0    1    2    2    1   -2   -2   -2    2   -1   -2    1    0
 1   -2   -2   -2    1    0    2    2    1   -1   -1    2    2   -2   -1    1
 2    0    0   -1   -1   -2    0    1   -2    1    2   -1   -1    0    0   -2
-1    0    2   -2    0    0   -1    0   -1    1   -2    1   -1    2    0   -1
-2   -2    0    0   -1    2   -2   -1   -2   -2   -1    2   -2    0   -2    1
 0   -1   -2   -2    0    2    0   -2   -2   -2   -2   -1    2    1    2    1
 2    2    1   -2    0   -1    2    0    2   -1   -1   -1    2    2   -1   -2
-1    1   -2    0   -1    1    0   -1   -2    1    0    1   -1   -2    1   -1
 1   -1    0    0   -1    2   -2    1   -1    1   -2   -2   -1    0    2   -2
-2   -2    2    1   -1    0    2    1    2   -1    1   -1    1    1    1    1
 0    0   -2    1   -2   -2    1   -1   -2    0    0    1    1    2    2   -2
-1   -2    2   -1    2   -2    1    0    0   -2   -2    0    0    0    0   -2
 1    1   -1    0    2   -2    0   -1   -2    0    0   -2    1   -2    0    1
 2   -1    0    0   -1   -2   -2    1    2    1    1    1   -1   -2    0    2
 1    2   -1    2    1    2   -2    2   -2    2    2    2    1    1    2    2

expected result (rows 40..55 out of 0..63, columns 8..23 out of 0..63)
-29   -32   -35   -41   -12    16    -8   -32    35   -29     0    18   -23   -37   -23    19
 13     9     8    -9    12   -26    -8   -13     8    -3    -4     6    48    26   -29    37
-35   -30    -2   -52    25     5    10   -39    25     3     9   -31    24   -26   -19    28
  0   -10   -27    22    11    21     8    15   -19     9   -10    15   -10    14    43    -7
 31    12    -3    -1   -25   -33   -19    -2    34    24    32     5   -11    -6   -21    49
 53   -36    -2    26   -37    16   -41    14    -4    11    -7     3     2   -11   -23     4
-22     4   -17   -33    -1    -4    55   -28   -14   -31   -39   -40    31    12   -20   -37
 15    -8     8    34   -14    40    25    23   -29     5   -10     2   -29    24   -12    -2
-33   -17   -33    38    28    -6    21     8   -54🦄 -18🦄 -46🦄 -10🦄 -28🦄 -35🦄  34🦄   8🦄
  4🦄 -10🦄  13🦄  21🦄  -5🦄 -17🦄  27🦄   1🦄 -34🦄  -6🦄 -12🦄 -11🦄 -31🦄  34🦄  35🦄   5🦄
 23🦄   7🦄 -16🦄  14🦄   9🦄 -25🦄 -26🦄 -32🦄  10     0    27     0   -36    -5   -28     8
 30   -25    -7    66    -5    -2    21    30    19   -21    31    19    27    -1     5    28
  0   -41     9    26    -8    32   -55    -9     8    35     5   -17    -1   -10    34    11
 -7   -22    32    -2    -8    26   -12   -13   -21    -8    35    43    14    10    -6   -20
-15🦄   2🦄   5🦄   2🦄  -2🦄  25🦄 -14🦄 -43🦄   9    12   -18    14   -15   -33     5    30
 39🦄  12🦄  25🦄  -4🦄   6🦄  13🦄  -1🦄  26🦄  -4🦄  13🦄  37🦄 -54🦄  22🦄  14🦄  19🦄  25🦄

actual result (rows 40..55 out of 0..63, columns 8..23 out of 0..63)
-29   -32   -35   -41   -12    16    -8   -32    35   -29     0    18   -23   -37   -23    19
 13     9     8    -9    12   -26    -8   -13     8    -3    -4     6    48    26   -29    37
-35   -30    -2   -52    25     5    10   -39    25     3     9   -31    24   -26   -19    28
  0   -10   -27    22    11    21     8    15   -19     9   -10    15   -10    14    43    -7
 31    12    -3    -1   -25   -33   -19    -2    34    24    32     5   -11    -6   -21    49
 53   -36    -2    26   -37    16   -41    14    -4    11    -7     3     2   -11   -23     4
-22     4   -17   -33    -1    -4    55   -28   -14   -31   -39   -40    31    12   -20   -37
 15    -8     8    34   -14    40    25    23   -29     5   -10     2   -29    24   -12    -2
-33   -17   -33    38    28    -6    21     8     0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞
  0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞
  0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞  10     0    27     0   -36    -5   -28     8
 30   -25    -7    66    -5    -2    21    30    19   -21    31    19    27    -1     5    28
  0   -41     9    26    -8    32   -55    -9     8    35     5   -17    -1   -10    34    11
 -7   -22    32    -2    -8    26   -12   -13   -21    -8    35    43    14    10    -6   -20
  0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   9    12   -18    14   -15   -33     5    30
  0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞

Note:

  1. The dump IR from the succeed and failed runs are the same.
  2. The intermittent wrong results happened even with the same input and vmfb files.
  3. It is not always the same rows of zeros that are wrong.
  4. The problem seems to only happen with pack-peel pipeline and data type bf16. I haven't seen such problem with pad-pack pipeline so far.
  5. The issue could be from compilation side where there is a race condition or from runtime side.
yzhang93 commented 5 months ago

@MaheshRavishankar @newling @erwei-xilinx The above is a summary of my findings after some experiments with the intermittent numerical problem. Let me know if you have any ideas or suggestions to debug this.

newling commented 5 months ago

I've just seen a failure with pad-pack:

https://github.com/nod-ai/iree-amd-aie/actions/runs/9082298857/job/24958476227?pr=337

expected result (rows 55..63 out of 0..63, columns 24..39 out of 0..63)
-20   -30     0    10    -7    13   -13    14    20   -23   -25    14    -3   -40    13    -2   
 -1     7     5    33    14     0    -9    -2    -7   -20   -12    28    29     8    19     0   
 13   -21   -36   -19     7    -8     6   -11    24    11    -7     2   -19    -9     0     4   
-27     4     2    23     8     4     6     1    -2    -9   -18   -21    22    -8     4    17   
 -9     3    -7   -22   -10   -11     1   -14     4   -18   -12    -4    15   -30   -11    -7   
-25   -14    -2     2    -9    32    -4   -14    37    17   -14    -3     4   -28    -6    -8   
-32    32   -20     3    37   -17    11   -42   -17    -1     7   -19    29    24     4    12   
 -3    -6    -8     7   -15    14    -1    -8    21    -1   -11   -24   -19   -14    12    22   
 13   -14     3    -3     2    -3    43   -17    16🦄  10🦄  29🦄   4🦄   6🦄   7🦄 -14🦄 -17🦄 

actual result (rows 55..63 out of 0..63, columns 24..39 out of 0..63)
-20   -30     0    10    -7    13   -13    14    20   -23   -25    14    -3   -40    13    -2   
 -1     7     5    33    14     0    -9    -2    -7   -20   -12    28    29     8    19     0   
 13   -21   -36   -19     7    -8     6   -11    24    11    -7     2   -19    -9     0     4   
-27     4     2    23     8     4     6     1    -2    -9   -18   -21    22    -8     4    17   
 -9     3    -7   -22   -10   -11     1   -14     4   -18   -12    -4    15   -30   -11    -7   
-25   -14    -2     2    -9    32    -4   -14    37    17   -14    -3     4   -28    -6    -8   
-32    32   -20     3    37   -17    11   -42   -17    -1     7   -19    29    24     4    12   
 -3    -6    -8     7   -15    14    -1    -8    21    -1   -11   -24   -19   -14    12    22   
 13   -14     3    -3     2    -3    43   -17     0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞   0🐞 

iree/tools/testing/e2e/iree-e2e-matmul-test.cc:532: ABORTED; while calling import; while invoking C++ function matmul_test.check_matmul_results; 
[ 1]   native matmul_test.check_matmul_results:0 -
[ 0] bytecode calls.matmul_64x64_64xi8__64_64_64_0:140 /home/github/actions-runner/_work/iree-amd-aie/iree-amd-aie/test1/mm_int8_i8_i32_m64_n64_k64_calls.mlir:10:1
MaheshRavishankar commented 5 months ago

If the vmfb is the same there are only two issues it can be 1) There is some uninitialized data being read 2) This is a driver issue.

newling commented 5 months ago

The pattern of failure (rows of the tensor) looks suspiciously like the pattern I saw previously. See issue https://github.com/nod-ai/iree-amd-aie/issues/209 for a summary of findings in that case. In the end, @nirvedhmeshram and @daveliddell and I decided to just update the driver on nuc50, because that made the intermittent issue go away (we believed).