sum/reduce op does not work for random inputs

johanna-rock-tt commented 3 days ago

Describe the bug The sum op (reduce) has large error for non integer inputs. Currently in test_sum.py only the summation of integers is tested - this is not sufficient and fails for random fp32 and bfloat16 inputs.

I added a test that uses random inputs and different data formats, and compares using comp_allclose. This test should pass.

What I see:

BF16 shape = (1, 1, 32, 32) ref vs tt = Max ATOL Delta: 0.0625, Max RTOL Delta: 0.068359375, PCC: 0.9999956409523759, Allclose check failed

FP32 shape = (1, 1, 32, 32) ref vs tt = Max ATOL Delta: 0.05738258361816406, Max RTOL Delta: 0.02300325594842434, PCC: 0.9999959183719673, Allclose check failed

BF16 shape = (1, 1, 32, 1024) ref vs tt = Max ATOL Delta: 2.75, Max RTOL Delta: 0.279296875, PCC: 0.9998845309940628, Allclose check failed

FP32 shape = (1, 1, 32, 1024) ref vs tt = Max ATOL Delta: 1.9558944702148438, Max RTOL Delta: 0.3209826350212097, PCC: 0.9998569160517634, Allclose check failed

Even adding two numbers fails:

Two input numbers: 1.9297, 1.4844 (all other numbers in tensor are zero)

Sum over that dim:
Torch reference (bf16): 3.4062
Torch reference (fp32): 3.4141
FP32: 3.4219
BF16: 3.4219
BFP8_B: 3.4375 
BFP4_B: 3.0

For larger shapes, e.g. [32, 1024] or [2048, 1024], this error accumulates to very large numbers.

To Reproduce Steps to reproduce the behavior:

Checkout jrock/reduce-issue
Run pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_sum.py::test_sum

johanna-rock-tt commented 3 days ago

The issue also exists with different reduction dims:

Two input numbers: 1.9297, 1.4844 dtype=bfloat16

sum_global torch: 3.4062 tt: 3.4219

grep -Rnw 'built' -e 'REDUCE_DIM' built/2052/kernels/reduce_w/4495494718154647535/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_ROW built/2052/kernels/reduce_h/15874065893775173914/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_COL

Sum in H torch: 3.4062 tt: 3.4219

grep -Rnw 'built' -e 'REDUCE_DIM' built/2052/kernels/reduce_h/15874065893775173914/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_COL

Sum in W torch: 3.4062 tt: 3.4219

grep -Rnw 'built' -e 'REDUCE_DIM' built/2052/kernels/reduce_w/4495494718154647535/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_ROW

johanna-rock-tt commented 2 days ago

fyi: @uaydonat

pavlejosipovic commented 20 hours ago

by taking a quick look at the impl it seems that reduce in metal doesn't support 32 bit acc (no matter which tensors are passed in)

cglagovichTT commented 16 hours ago

Repro

branch: cglagovich/fp32_slow_reduce

pytest -svv  tests/tt_eager/python_api_testing/unit_testing/misc/test_sum.py::test_sum[shape_dim0-BFLOAT16]

ref vs tt = Max ATOL Delta: 0.0625, Max RTOL Delta: 0.068359375, PCC: 0.9999956409523759, Allclose check failed
expect:  tensor([  6.9688,  -4.8750,   2.3125,   6.0000,   5.6562,  -7.2188,   4.8438,
          2.4688,  -0.8047,  -2.2500,   2.7656,   3.6875,  -1.5156,  -0.0605,
          0.7969,  -4.5938,  -6.1562,   5.8125,  -3.6562,   6.2812,  -6.8125,
          6.2188,   6.9062,   4.6562,  -6.0000,  -4.7188,   3.6094, -12.6250,
          6.2188,  -2.2188,  -9.8125,   7.0938], dtype=torch.bfloat16)
got:  tensor([  7.0000,  -4.8750,   2.3281,   6.0000,   5.6562,  -7.2188,   4.8750,
          2.4688,  -0.7812,  -2.2500,   2.7656,   3.7031,  -1.5234,  -0.0564,
          0.7891,  -4.5938,  -6.1562,   5.8125,  -3.6562,   6.2812,  -6.8125,
          6.2188,   6.9062,   4.6562,  -6.0000,  -4.7500,   3.6094, -12.5625,
          6.2500,  -2.2344,  -9.8125,   7.0938], dtype=torch.bfloat16)

Now I modify the reduce_w.cpp file to make the unpacker perform reduce_sum with fp32 accumulation. These are the results with fp32 accumulation:

ref vs tt = Max ATOL Delta: 0.0625, Max RTOL Delta: 0.0069580078125, PCC: 0.9999964480552922, Allclose check failed
expect:  tensor([  6.9688,  -4.8750,   2.3125,   6.0000,   5.6562,  -7.2188,   4.8438,
          2.4688,  -0.8047,  -2.2500,   2.7656,   3.6875,  -1.5156,  -0.0605,
          0.7969,  -4.5938,  -6.1562,   5.8125,  -3.6562,   6.2812,  -6.8125,
          6.2188,   6.9062,   4.6562,  -6.0000,  -4.7188,   3.6094, -12.6250,
          6.2188,  -2.2188,  -9.8125,   7.0938], dtype=torch.bfloat16)
got:  tensor([  6.9688,  -4.8438,   2.3125,   5.9688,   5.6250,  -7.2188,   4.8125,
          2.4688,  -0.8008,  -2.2344,   2.7656,   3.6875,  -1.5156,  -0.0605,
          0.7969,  -4.5938,  -6.1562,   5.7812,  -3.6406,   6.2500,  -6.7812,
          6.2188,   6.9062,   4.6562,  -6.0000,  -4.7188,   3.5938, -12.5625,
          6.2188,  -2.2188,  -9.7500,   7.0625], dtype=torch.bfloat16)

All I can say with this result is that the reduce_tile impl without fp32 accumulation does not have greater error than a reduce_tile with fp32 accumulation. I would have to see if this holds for larger inputs.

tenstorrent / tt-metal

sum/reduce op does not work for random inputs #9908

Repro