tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
372 stars 43 forks source link

tt_lib.tensor.reduce operation failed with low PCC [Grayskull] #7007

Open banekg opened 4 months ago

banekg commented 4 months ago

Describe the bug tt_lib.tensor.reduce operation, with math_op='max' and dim='H', breaks with low PCC value error in some test cases. tt_lib.tensor.reduce operation, with math_op='sum' and dim='H', breaks with low PCC value error in some test cases.

To Reproduce Steps to reproduce the behavior:

  1. Checkout main branch
  2. Run unit tests test_reduce_max_h.py and test_reduce_sum_h.py using these commands: pytest tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_reduce_max_h.py pytest tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_reduce_sum_h.py

Expected behavior There is a test case presented in the unit test tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_reduce_max_h.py and it is are expected to fail with low PCC value.

Getting Additional info for the operation under test and its behavior To get additional information and results for different combinations of input shapes, types, layouts and memory configs for which this operation was tested you can also run locally sweeps for tt_lib.tensor.reduce and check the results. To do this you should:

  1. Follow the Getting Started page to setup the repo, environment variables and python-env
  2. Run sweeps by using: pytest tests/tt_eager/python_api_testing/sweep_tests/run_sweep_test.py --input-path tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/pytorch_reduce_max_h_test.yaml --input-method cli --cli-input results_reduce_max_h

or

pytest tests/tt_eager/python_api_testing/sweep_tests/run_sweep_test.py --input-path tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/pytorch_reduce_sum_h_test.yaml --input-method cli --cli-input results_reduce_sum_h

  1. After the run is completed all test sweeps results should be available inside specified output directory.
umadevimcw commented 4 months ago
image

@banekg Are you still facing this issue? When I test this test today it is passed. Please find the above image For sweep test I didn't find such yaml file

umadevimcw commented 4 months ago

@jliangTT @banekg The error is not consistent because on repeated runs of the above test, sometimes the test passes and sometimes the test fails (a few values were zero and instead of storing max )

@jliangTT Need help from TT to debug on this

umadevimcw commented 4 months ago

@jliangTT and @banekg This issue is not occurring in WHB0. Even after repetitive runs, the test is passing consistently with no mismatch. Since it is happening only in GS, I suspect the problem might be with data synchronization (as some times tests are passing as well) and not with the logic involved in the computation.

@jliangTT Let me know how to proceed further.

The same applicable to the sum test.

In the image, you can find that in the TT tensor output, the first few values are zero instead of values (same scenario for both tests max and sum in GS)

Red color -- TT data Green color -- Torch data

image
jliangTT commented 4 months ago

hey this is good progress. I am going to mark this as blocked. But as this is p2, let's steer our effort to other high priority issue.

banekg commented 4 months ago

Hi @umadevimcw, I'm still getting failed unit tests, for both operations.

umadevimcw commented 4 months ago

@jliangTT @banekg Keeping the input memconfig as SYSTEM_MEMORY is creating an issue. Updated the config and created the PR. With these changes tests are passing

umadevimcw commented 4 months ago

@banekg Please find the PR #7929

umadevimcw commented 3 months ago

@tt-aho Please share your inputs on this. When the input memconfig is system memory we are facing mismatch issue as shown in the image here

https://github.com/tenstorrent/tt-metal/issues/7007#issuecomment-2058772702

tt-aho commented 3 months ago

I will need to take a look. To confirm the details, this only happens on GS?

And input memconfig is system memory means we are passing TT tensors on host to the op?

nemanjagrujic commented 3 months ago

I will need to take a look. To confirm the details, this only happens on GS?

And input memconfig is system memory means we are passing TT tensors on host to the op?

This only happens in Grayskull / ROW_MAJOR / SYSTEM_MEMORY combination.

@tt-aho @umadevimcw