tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
303 stars 26 forks source link

sum/reduce op does not work for random inputs #9908

Open johanna-rock-tt opened 3 days ago

johanna-rock-tt commented 3 days ago

Describe the bug The sum op (reduce) has large error for non integer inputs. Currently in test_sum.py only the summation of integers is tested - this is not sufficient and fails for random fp32 and bfloat16 inputs.

I added a test that uses random inputs and different data formats, and compares using comp_allclose. This test should pass.

What I see:

BF16 shape = (1, 1, 32, 32) ref vs tt = Max ATOL Delta: 0.0625, Max RTOL Delta: 0.068359375, PCC: 0.9999956409523759, Allclose check failed

FP32 shape = (1, 1, 32, 32) ref vs tt = Max ATOL Delta: 0.05738258361816406, Max RTOL Delta: 0.02300325594842434, PCC: 0.9999959183719673, Allclose check failed

BF16 shape = (1, 1, 32, 1024) ref vs tt = Max ATOL Delta: 2.75, Max RTOL Delta: 0.279296875, PCC: 0.9998845309940628, Allclose check failed

FP32 shape = (1, 1, 32, 1024) ref vs tt = Max ATOL Delta: 1.9558944702148438, Max RTOL Delta: 0.3209826350212097, PCC: 0.9998569160517634, Allclose check failed

Even adding two numbers fails:

Two input numbers: 1.9297, 1.4844 (all other numbers in tensor are zero)

Sum over that dim:
Torch reference (bf16): 3.4062
Torch reference (fp32): 3.4141
FP32: 3.4219
BF16: 3.4219
BFP8_B: 3.4375 
BFP4_B: 3.0

For larger shapes, e.g. [32, 1024] or [2048, 1024], this error accumulates to very large numbers.

To Reproduce Steps to reproduce the behavior:

  1. Checkout jrock/reduce-issue
  2. Run pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_sum.py::test_sum
johanna-rock-tt commented 3 days ago

The issue also exists with different reduction dims:

Two input numbers: 1.9297, 1.4844 dtype=bfloat16

sum_global torch: 3.4062 tt: 3.4219

grep -Rnw 'built' -e 'REDUCE_DIM' built/2052/kernels/reduce_w/4495494718154647535/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_ROW built/2052/kernels/reduce_h/15874065893775173914/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_COL

Sum in H torch: 3.4062 tt: 3.4219

grep -Rnw 'built' -e 'REDUCE_DIM' built/2052/kernels/reduce_h/15874065893775173914/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_COL

Sum in W torch: 3.4062 tt: 3.4219

grep -Rnw 'built' -e 'REDUCE_DIM' built/2052/kernels/reduce_w/4495494718154647535/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_ROW

johanna-rock-tt commented 2 days ago

fyi: @uaydonat

pavlejosipovic commented 20 hours ago

by taking a quick look at the impl it seems that reduce in metal doesn't support 32 bit acc (no matter which tensors are passed in)

cglagovichTT commented 16 hours ago

Repro

ref vs tt = Max ATOL Delta: 0.0625, Max RTOL Delta: 0.068359375, PCC: 0.9999956409523759, Allclose check failed
expect:  tensor([  6.9688,  -4.8750,   2.3125,   6.0000,   5.6562,  -7.2188,   4.8438,
          2.4688,  -0.8047,  -2.2500,   2.7656,   3.6875,  -1.5156,  -0.0605,
          0.7969,  -4.5938,  -6.1562,   5.8125,  -3.6562,   6.2812,  -6.8125,
          6.2188,   6.9062,   4.6562,  -6.0000,  -4.7188,   3.6094, -12.6250,
          6.2188,  -2.2188,  -9.8125,   7.0938], dtype=torch.bfloat16)
got:  tensor([  7.0000,  -4.8750,   2.3281,   6.0000,   5.6562,  -7.2188,   4.8750,
          2.4688,  -0.7812,  -2.2500,   2.7656,   3.7031,  -1.5234,  -0.0564,
          0.7891,  -4.5938,  -6.1562,   5.8125,  -3.6562,   6.2812,  -6.8125,
          6.2188,   6.9062,   4.6562,  -6.0000,  -4.7500,   3.6094, -12.5625,
          6.2500,  -2.2344,  -9.8125,   7.0938], dtype=torch.bfloat16)

Now I modify the reduce_w.cpp file to make the unpacker perform reduce_sum with fp32 accumulation. These are the results with fp32 accumulation:

ref vs tt = Max ATOL Delta: 0.0625, Max RTOL Delta: 0.0069580078125, PCC: 0.9999964480552922, Allclose check failed
expect:  tensor([  6.9688,  -4.8750,   2.3125,   6.0000,   5.6562,  -7.2188,   4.8438,
          2.4688,  -0.8047,  -2.2500,   2.7656,   3.6875,  -1.5156,  -0.0605,
          0.7969,  -4.5938,  -6.1562,   5.8125,  -3.6562,   6.2812,  -6.8125,
          6.2188,   6.9062,   4.6562,  -6.0000,  -4.7188,   3.6094, -12.6250,
          6.2188,  -2.2188,  -9.8125,   7.0938], dtype=torch.bfloat16)
got:  tensor([  6.9688,  -4.8438,   2.3125,   5.9688,   5.6250,  -7.2188,   4.8125,
          2.4688,  -0.8008,  -2.2344,   2.7656,   3.6875,  -1.5156,  -0.0605,
          0.7969,  -4.5938,  -6.1562,   5.7812,  -3.6406,   6.2500,  -6.7812,
          6.2188,   6.9062,   4.6562,  -6.0000,  -4.7188,   3.5938, -12.5625,
          6.2188,  -2.2188,  -9.7500,   7.0625], dtype=torch.bfloat16)

All I can say with this result is that the reduce_tile impl without fp32 accumulation does not have greater error than a reduce_tile with fp32 accumulation. I would have to see if this holds for larger inputs.