tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
484 stars 80 forks source link

misc/test_layernorm.py::test_layernorm_mix_precision mismatch #14352

Open bbradelTT opened 1 month ago

bbradelTT commented 1 month ago

Command to run: pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision

All the errors are float32:

=============================================================== short test summary info ===============================================================
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7800511474028167
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.696830280445704
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.9028278880387816
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.728569147644891
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.953830626775725
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.8799854050396232
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7825994085803905
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7017235847606049
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.8987713880620999
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7415875788499321
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.9044729047582791
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.8868642977685586
=============================================== 12 failed, 24 passed, 36 warnings in 264.05s (0:04:24) ================================================
abhullar-tt commented 4 weeks ago

This suite passes when fp32_dest_acc_en is disabled in misc/test_layernorm.py @ncvetkovicTT could you help take a look at this?

ncvetkovicTT commented 1 week ago

Hey @abhullar-tt, @bbradelTT. Please take a look at ncvetkovic/14352_layernorm_fail so that we can continue the investigation. What I've done there is run only test_id=0 and in_dtype=float32 for simplicity. I have then played around with input shapes, because I suspected that either input width or height can't be larger than 4 fp32 tiles, because when we haveDstSync::SyncHalf, that's how many tiles can fit in half of the DEST.

Also, if the width is kept at 4 tiles, height can't be larger than 130 tiles, I assume this is core grid constraint because I remember seeing that number somewhere, but I am not sure.

Now there are a couple of things I don't understand here:

  1. If I keep Wt at 2 tiles, I can do whatever I want with N, C and Ht. If I put Wt=3, then the same constraint applies as with Wt=4.
  2. If I run all the tests for all the in_dtypes and all of the shapes that are uncommented in my branch, the last six examples fail like this (a drawback is that I don't know for which of the uncommented shape this was, but test passed for [1,1,32,32] when I ran it again later):
=================================================================================================================================================================================================== short test summary info ====================================================================================================================================================================================================
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_RMSN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_G-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_G-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_GB-BFLOAT16-BFLOAT16-in0_L1-in0_L1]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_GB-BFLOAT8_B-BFLOAT16-in0_L1-in0_L1]
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7051563259499515
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7510356020184914
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[LN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.8523632833423422
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7096639868818564
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_G-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7260290469183038
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[RMSN_GB-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.8573792455041984
========================================================================================================================================================================================== 6 failed, 30 passed, 36 warnings in 57.88s ==========================================================================================================================================================================================

Please notice again my comments in the test_dims array. Look closely the PCC difference as we increase the height of the block from 130 to 131 tiles and beyond - the PCC gradually decreases, meaning that the first 130x4 tiles are transferred and processed correctly, but the rest are not (they're either transferred poorly, or not processed properly by the compute kernel).

I can continue the hunt once we bring the test to the point where it fails for a single tile or at least some small number of them (like 4 tiles), just let me know.

ncvetkovicTT commented 1 week ago

Tried also on WH, all input shapes and test variants pass there.

ncvetkovicTT commented 1 week ago

Hey @bbradelTT, you can also take a look at #14594. As you can see, there the W dimension doesn't matter, or at least it is limited by L1/CB size and not DEST size.

bbradelTT commented 5 days ago

I think the 130 was mentioned in https://github.com/tenstorrent/tt-metal/issues/14609#issuecomment-2491915367 I tried the fix in https://github.com/tenstorrent/tt-llk-bh/pull/50 as well as https://github.com/tenstorrent/tt-llk-bh/compare/main...nvelickovic/fix_pack_tilize

Neither of them made the tests pass.

The PCC is changing non deterministically, and is different for each run:

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7797403340938521
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7788630498131559
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7796628078216301
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7783906940655255

The rejected fix main factor seems to be https://github.com/tenstorrent/tt-llk-bh/commit/464f00bdf3972d09cae21393820664c8e74d22c5 made the PCC When I had that in I got the following:

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7800424651589892
ncvetkovicTT commented 5 days ago

I think the 130 was mentioned in #14609 (comment) I tried the fix in tenstorrent/tt-llk-bh#50 as well as tenstorrent/tt-llk-bh@main...nvelickovic/fix_pack_tilize

Neither of them made the tests pass.

The PCC is changing non deterministically, and is different for each run:

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7797403340938521
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7788630498131559
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7796628078216301
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7783906940655255

The rejected fix main factor seems to be tenstorrent/tt-llk-bh@464f00b made the PCC When I had that in I got the following:

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_layernorm.py::test_layernorm_mix_precision[add_LN-FLOAT32-BFLOAT16-in0_L1-in0_L1] - AssertionError: 0.7800424651589892

@bbradelTT wait an hour or so pls, I think I might have a solution for you.

ncvetkovicTT commented 4 days ago

@bbradelTT Unfortunately, what I thought was the fix for #14609 was not for your issue, will continue the investigation.