tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
471 stars 74 forks source link

ttl.tensor.fill_bw operation fails with low PCC #5145

Open banekg opened 9 months ago

banekg commented 9 months ago

ttl.tensor.fill_bw operation fails with low PCC in some test cases on both Worhhole and Grayskull cards.

To Reproduce Steps to reproduce the behavior: Checkout main branch (Soon to be merged in main). Run unit tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_backward_fill.py using this command:

pytest tests/tt_eager/python_api_testing/non_working_unit_tests/grayskull/test_backward_fill.py

Expected behavior There is a test case presented in the unit test which fails with low PCC error. For example output can be:

 Max ATOL Delta: nan, Max RTOL Delta: nan, PCC: 0.0 , PCC check failed

Getting Details You can get more combinations tested using test sweep:

tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/pytorch_backward_fill_test.yaml or tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_beroken/wormhole/pytorch_backward_fill_test.yaml

with command for example:

python tests/tt_eager/python_api_testing/sweep_tests/run_pytorch_test.py -i tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/grayskull/pytorch_backward_fill_test.yaml -o ./result-sweeps

Result will be saved in ./result-sweeps folder

muthutt commented 9 months ago

Can you check if this is the same problem ? https://github.com/tenstorrent-metal/tt-metal/pull/5113

On Tue, Feb 6, 2024 at 3:20 AM banekg @.***> wrote:

Assigned #5145 https://github.com/tenstorrent-metal/tt-metal/issues/5145 to @muthutt https://github.com/muthutt.

— Reply to this email directly, view it on GitHub https://github.com/tenstorrent-metal/tt-metal/issues/5145#event-11716637448, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAGOCNACJQFFWY2Z76UKDKDYSIGXFAVCNFSM6AAAAABC3ZMVWOVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRG4YTMNRTG42DIOA . You are receiving this because you were assigned.Message ID: @.***>

banekg commented 9 months ago

Hi @muthutt ,

I checked the #5113 solution, it is not the same issue. I used your approach x = torch.where(x.abs() > 1e-3, x, 1e-3), it failed with low PCC value error.

ntarafdar commented 8 months ago

@banekg what's the status on this?

banekg commented 8 months ago

@tarafdarTT for four available test cases I'm still getting that PCC: 0.0.