Open bbradelTT opened 1 month ago
This suite passes when fp32_dest_acc_en
is disabled in misc/test_layernorm.py
@ncvetkovicTT could you help take a look at this?
Hey @abhullar-tt, @bbradelTT. Please take a look at ncvetkovic/14352_layernorm_fail so that we can continue the investigation. What I've done there is run only test_id=0
and in_dtype=float32
for simplicity. I have then played around with input shapes, because I suspected that either input width or height can't be larger than 4 fp32 tiles, because when we haveDstSync::SyncHalf
, that's how many tiles can fit in half of the DEST.
Also, if the width is kept at 4 tiles, height can't be larger than 130 tiles, I assume this is core grid constraint because I remember seeing that number somewhere, but I am not sure.
Now there are a couple of things I don't understand here:
=================================================================================================================================================================================================== short test summary info ====================================================================================================================================================================================================
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7051563259499515
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7510356020184914
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.8523632833423422
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7096639868818564
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7260290469183038
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.8573792455041984
========================================================================================================================================================================================== 6 failed, 30 passed, 36 warnings in 57.88s ==========================================================================================================================================================================================
Please notice again my comments in the test_dims
array. Look closely the PCC difference as we increase the height of the block from 130 to 131 tiles and beyond - the PCC gradually decreases, meaning that the first 130x4 tiles are transferred and processed correctly, but the rest are not (they're either transferred poorly, or not processed properly by the compute kernel).
I can continue the hunt once we bring the test to the point where it fails for a single tile or at least some small number of them (like 4 tiles), just let me know.
Tried also on WH, all input shapes and test variants pass there.
Hey @bbradelTT, you can also take a look at #14594. As you can see, there the W dimension doesn't matter, or at least it is limited by L1/CB size and not DEST size.
I think the 130 was mentioned in https://github.com/tenstorrent/tt-metal/issues/14609#issuecomment-2491915367 I tried the fix in https://github.com/tenstorrent/tt-llk-bh/pull/50 as well as https://github.com/tenstorrent/tt-llk-bh/compare/main...nvelickovic/fix_pack_tilize
Neither of them made the tests pass.
The PCC is changing non deterministically, and is different for each run:
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7797403340938521
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7788630498131559
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7796628078216301
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7783906940655255
The rejected fix main factor seems to be https://github.com/tenstorrent/tt-llk-bh/commit/464f00bdf3972d09cae21393820664c8e74d22c5 made the PCC When I had that in I got the following:
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7800424651589892
I think the 130 was mentioned in #14609 (comment) I tried the fix in tenstorrent/tt-llk-bh#50 as well as tenstorrent/tt-llk-bh@main...nvelickovic/fix_pack_tilize
Neither of them made the tests pass.
The PCC is changing non deterministically, and is different for each run:
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7797403340938521 FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7788630498131559 FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7796628078216301 FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7783906940655255
The rejected fix main factor seems to be tenstorrent/tt-llk-bh@464f00b made the PCC When I had that in I got the following:
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7800424651589892
@bbradelTT wait an hour or so pls, I think I might have a solution for you.
@bbradelTT Unfortunately, what I thought was the fix for #14609 was not for your issue, will continue the investigation.
Command to run: pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision
All the errors are float32: