Open johanna-rock-tt opened 3 days ago
The issue also exists with different reduction dims:
Two input numbers: 1.9297, 1.4844 dtype=bfloat16
sum_global torch: 3.4062 tt: 3.4219
grep -Rnw 'built' -e 'REDUCE_DIM' built/2052/kernels/reduce_w/4495494718154647535/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_ROW built/2052/kernels/reduce_h/15874065893775173914/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_COL
Sum in H torch: 3.4062 tt: 3.4219
grep -Rnw 'built' -e 'REDUCE_DIM' built/2052/kernels/reduce_h/15874065893775173914/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_COL
Sum in W torch: 3.4062 tt: 3.4219
grep -Rnw 'built' -e 'REDUCE_DIM' built/2052/kernels/reduce_w/4495494718154647535/defines_generated.h:2:#define REDUCE_DIM ReduceDim::REDUCE_ROW
fyi: @uaydonat
by taking a quick look at the impl it seems that reduce in metal doesn't support 32 bit acc (no matter which tensors are passed in)
cglagovich/fp32_slow_reduce
pytest -svv tests/tt_eager/python_api_testing/unit_testing/misc/test_sum.py::test_sum[shape_dim0-BFLOAT16]
ref vs tt = Max ATOL Delta: 0.0625, Max RTOL Delta: 0.068359375, PCC: 0.9999956409523759, Allclose check failed
expect: tensor([ 6.9688, -4.8750, 2.3125, 6.0000, 5.6562, -7.2188, 4.8438,
2.4688, -0.8047, -2.2500, 2.7656, 3.6875, -1.5156, -0.0605,
0.7969, -4.5938, -6.1562, 5.8125, -3.6562, 6.2812, -6.8125,
6.2188, 6.9062, 4.6562, -6.0000, -4.7188, 3.6094, -12.6250,
6.2188, -2.2188, -9.8125, 7.0938], dtype=torch.bfloat16)
got: tensor([ 7.0000, -4.8750, 2.3281, 6.0000, 5.6562, -7.2188, 4.8750,
2.4688, -0.7812, -2.2500, 2.7656, 3.7031, -1.5234, -0.0564,
0.7891, -4.5938, -6.1562, 5.8125, -3.6562, 6.2812, -6.8125,
6.2188, 6.9062, 4.6562, -6.0000, -4.7500, 3.6094, -12.5625,
6.2500, -2.2344, -9.8125, 7.0938], dtype=torch.bfloat16)
Now I modify the reduce_w.cpp
file to make the unpacker perform reduce_sum with fp32 accumulation.
These are the results with fp32 accumulation:
ref vs tt = Max ATOL Delta: 0.0625, Max RTOL Delta: 0.0069580078125, PCC: 0.9999964480552922, Allclose check failed
expect: tensor([ 6.9688, -4.8750, 2.3125, 6.0000, 5.6562, -7.2188, 4.8438,
2.4688, -0.8047, -2.2500, 2.7656, 3.6875, -1.5156, -0.0605,
0.7969, -4.5938, -6.1562, 5.8125, -3.6562, 6.2812, -6.8125,
6.2188, 6.9062, 4.6562, -6.0000, -4.7188, 3.6094, -12.6250,
6.2188, -2.2188, -9.8125, 7.0938], dtype=torch.bfloat16)
got: tensor([ 6.9688, -4.8438, 2.3125, 5.9688, 5.6250, -7.2188, 4.8125,
2.4688, -0.8008, -2.2344, 2.7656, 3.6875, -1.5156, -0.0605,
0.7969, -4.5938, -6.1562, 5.7812, -3.6406, 6.2500, -6.7812,
6.2188, 6.9062, 4.6562, -6.0000, -4.7188, 3.5938, -12.5625,
6.2188, -2.2188, -9.7500, 7.0625], dtype=torch.bfloat16)
All I can say with this result is that the reduce_tile
impl without fp32 accumulation does not have greater error than a reduce_tile
with fp32 accumulation. I would have to see if this holds for larger inputs.
Describe the bug The sum op (reduce) has large error for non integer inputs. Currently in
test_sum.py
only the summation of integers is tested - this is not sufficient and fails for random fp32 and bfloat16 inputs.I added a test that uses random inputs and different data formats, and compares using
comp_allclose
. This test should pass.What I see:
BF16 shape = (1, 1, 32, 32) ref vs tt = Max ATOL Delta: 0.0625, Max RTOL Delta: 0.068359375, PCC: 0.9999956409523759, Allclose check failed
FP32 shape = (1, 1, 32, 32) ref vs tt = Max ATOL Delta: 0.05738258361816406, Max RTOL Delta: 0.02300325594842434, PCC: 0.9999959183719673, Allclose check failed
BF16 shape = (1, 1, 32, 1024) ref vs tt = Max ATOL Delta: 2.75, Max RTOL Delta: 0.279296875, PCC: 0.9998845309940628, Allclose check failed
FP32 shape = (1, 1, 32, 1024) ref vs tt = Max ATOL Delta: 1.9558944702148438, Max RTOL Delta: 0.3209826350212097, PCC: 0.9998569160517634, Allclose check failed
Even adding two numbers fails:
For larger shapes, e.g. [32, 1024] or [2048, 1024], this error accumulates to very large numbers.
To Reproduce Steps to reproduce the behavior:
jrock/reduce-issue
pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_sum.py::test_sum