tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
442 stars 63 forks source link

Some TTNN operations fail with low PCC error [Grayskull and Wormhole] #5893

Open banekg opened 7 months ago

banekg commented 7 months ago

Describe the bug Some TTNN operation break with low PCC error in some test cases, for both, Grayskull and Wormhole chips.

TTNN operations which failed are:

To Reproduce Steps to reproduce the behavior for both chips, GS and WH:

  1. Checkout barsic/ttnn-ops branch (to be merged into main soon)
  2. Run unit test test_eltwise_sin.py using this command: pytest tests/ttnn/python_api_testing/non_working_unit_tests/grayskull/test_eltwise_sin.py

Expected behavior There are test cases presented in the unit tests pytest tests/ttnn/python_api_testing/non_working_unit_tests/grayskull/test_eltwise_sin.py and they all are expected to fail with low PCC error.

Other unit tests (for Grayskull, and Wormhole):

tests/ttnn/python_api_testing/non_working_unit_tests/grayskull/test_eltwise_tan.py
tests/ttnn/python_api_testing/non_working_unit_tests/grayskull/test_eltwise_sqrt.py
tests/ttnn/python_api_testing/non_working_unit_tests/grayskull/test_eltwise_rsqrt.py

Getting Additional info for the operation under test and its behavior To get additional information and results for different combinations of input shapes, types, layouts and memory configs for which this operation was tested you can also run locally sweeps for tt_lib.tensor.complex_abs and check the results. To do this you should:

  1. Follow the Getting Started page to setup the repo, environment variables and python-env
  2. Activate source build/python_env/bin/activate
  3. Run sweeps by using python tests/tt_eager/python_api_testing/sweep_tests/run_pytorch_test.py -i tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/ttnn_eltwise_sin_test.yaml -o /home/ubuntu/tt-metal/ttnn-test-sweeps/sin
  4. After the run is completed all test sweeps results should be available inside specified output directory (in this case ./result-sweeps). There you will find eltwise_sin_sweep.csv which holds all executed sweeps, among which you can also find the ones that failed and were recreated by the unit test, which you can get by searching unique data_seed field.

Other sweep tests:

tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/ttnn_eltwise_tan_test.yaml
tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/ttnn_eltwise_sqrt_test.yaml
tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/ttnn_eltwise_rsqrt_test.yaml
eyonland commented 7 months ago

None of these tests actually fail the pipeline so it's not easy to know if this is a new problem or an existing one. I see changes as far base as three weeks ago from commits references #5137 that could have impacted the pcc.

@banekg , can you provide a clarity as to whether this is a new issue where the PCC dropped? If so, do you know what the prior PCC was on an example test?

My thought is that we might be able to use this information to do a binary search on through the list of commits to isolate which one actually introduced the new failures.

banekg commented 7 months ago

Hi @eyonland,

Thanks for the comment. We developed the new test sweeps which covered a large number of different input shapes for each of the operations listed above. These tests have not been run before. It is also important to mentions that the operations do not fail for all use cases, but for some specific ones.

ruthreshx commented 1 month ago

Hi @banekg , Please find the doc link from the latest main. https://docs.google.com/document/d/1kgvpwHNC3uBAyq0ampPGPSt0NhJQoUjB9B6aj1EZdxA/edit

nemanjagrujic commented 1 month ago

Hello, @ruthreshx, @banekg is no longer working for TT. Can you please share doc with me (ngrujic@tenstorrent.com)?

ruthreshx commented 1 month ago

Hi @nemanjagrujic , Please verify https://docs.google.com/document/d/1kgvpwHNC3uBAyq0ampPGPSt0NhJQoUjB9B6aj1EZdxA/edit

ruthreshx commented 1 month ago

Hi @nemanjagrujic , @eyonland Can we remove the Sinh & polygamma was passed on both GS & WH Tan supports only the range from -1.45 to 1.45, Sqrt & RSqrt supports the specific range it will be handled during the doc updation.

Sin op not sure, I have verified the implementations of LLK and Ckernel as well. I will discuss with the TT team, reg the Sin issue. WH has passed, GS failed.

nemanjagrujic commented 1 month ago

@ruthreshx Removed sink and polygamma.

ruthreshx commented 1 month ago

@ruthreshx Removed sink and polygamma.

Thanks @nemanjagrujic