Open banekg opened 4 months ago
@banekg Are you still facing this issue? When I test this test today it is passed. Please find the above image For sweep test I didn't find such yaml file
@jliangTT @banekg The error is not consistent because on repeated runs of the above test, sometimes the test passes and sometimes the test fails (a few values were zero and instead of storing max )
@jliangTT Need help from TT to debug on this
@jliangTT and @banekg This issue is not occurring in WHB0. Even after repetitive runs, the test is passing consistently with no mismatch. Since it is happening only in GS, I suspect the problem might be with data synchronization (as some times tests are passing as well) and not with the logic involved in the computation.
@jliangTT Let me know how to proceed further.
The same applicable to the sum test.
In the image, you can find that in the TT tensor output, the first few values are zero instead of values (same scenario for both tests max and sum in GS)
Red color -- TT data Green color -- Torch data
hey this is good progress. I am going to mark this as blocked. But as this is p2, let's steer our effort to other high priority issue.
Hi @umadevimcw, I'm still getting failed unit tests, for both operations.
@jliangTT @banekg Keeping the input memconfig as SYSTEM_MEMORY is creating an issue. Updated the config and created the PR. With these changes tests are passing
@banekg Please find the PR #7929
@tt-aho Please share your inputs on this. When the input memconfig is system memory we are facing mismatch issue as shown in the image here
https://github.com/tenstorrent/tt-metal/issues/7007#issuecomment-2058772702
I will need to take a look. To confirm the details, this only happens on GS?
And input memconfig is system memory means we are passing TT tensors on host to the op?
I will need to take a look. To confirm the details, this only happens on GS?
And input memconfig is system memory means we are passing TT tensors on host to the op?
This only happens in Grayskull / ROW_MAJOR / SYSTEM_MEMORY combination.
@tt-aho @umadevimcw
Describe the bug tt_lib.tensor.reduce operation, with math_op='max' and dim='H', breaks with low PCC value error in some test cases. tt_lib.tensor.reduce operation, with math_op='sum' and dim='H', breaks with low PCC value error in some test cases.
To Reproduce Steps to reproduce the behavior:
main
branchtest_reduce_max_h.py
andtest_reduce_sum_h.py
using these commands:pytest tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_reduce_max_h.py
pytest tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_reduce_sum_h.py
Expected behavior There is a test case presented in the unit test
tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_reduce_max_h.py
and it is are expected to fail with low PCC value.Getting Additional info for the operation under test and its behavior To get additional information and results for different combinations of input shapes, types, layouts and memory configs for which this operation was tested you can also run locally sweeps for tt_lib.tensor.reduce and check the results. To do this you should:
Getting Started
page to setup the repo, environment variables andpython-env
pytest tests/tt_eager/python_api_testing/sweep_tests/run_sweep_test.py --input-path tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/pytorch_reduce_max_h_test.yaml --input-method cli --cli-input results_reduce_max_h
or
pytest tests/tt_eager/python_api_testing/sweep_tests/run_sweep_test.py --input-path tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/pytorch_reduce_sum_h_test.yaml --input-method cli --cli-input results_reduce_sum_h