tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
457 stars 67 forks source link

isfinite, isinf, isposinf, isneginf and isnan fail with low PCC on Wormhole cards #4409

Open JelenaTosicJtosic opened 10 months ago

JelenaTosicJtosic commented 10 months ago

Multiple operations that check if tensor is inf or NaN fail on Wormhole cards with 0 PCC in all combinations (any memory layout, type or buffer layout). Problem hapens for both tt_lib.tensor and ttnn variants.

  1. tt_lib.tensor.isfinite and ttnn.isfinite operations break with low PCC error in all test cases.
  2. tt_lib.tensor.isinf and ttnn.isinf operations break with low PCC error in all test cases.
  3. tt_lib.tensor.isposinf and ttnn.isposinf operations break with low PCC error in all test cases.
  4. tt_lib.tensor.isneginf and ttnn.isneginf operations break with low PCC error in all test cases.
  5. tt_lib.tensor.isnan and ttnn.isnan operations break with low PCC error in all test cases.

To Reproduce Steps to reproduce the behavior:

  1. Checkout main branch
  2. Run unit tests for different ops using commands:
pytest tests/tt_eager/python_api_testing/non_working_unit_tests/wormhole/test_eltwise_isfinite.py
pytest tests/tt_eager/python_api_testing/non_working_unit_tests/wormhole/test_isinf.py
pytest tests/tt_eager/python_api_testing/non_working_unit_tests/wormhole/test_isposinf.py
pytest tests/tt_eager/python_api_testing/non_working_unit_tests/wormhole/test_isneginf.py
pytest tests/tt_eager/python_api_testing/non_working_unit_tests/wormhole/test_isnan.py

for tt_lib variant. Or:

pytest tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_eltwise_isfinite.py
pytest tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_eltwise_isinf.py
pytest tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_eltwise_isnan.py
pytest tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_eltwise_isneginf.py
pytest tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_eltwise_isposinf.py

for ttnn variant.

Expected behavior There are two test cases presented in each of the previously mentioned unit tests. For GS only isnan is expected to fail and for WH all other units are expected to fail with low PCC error. For example, one of the tests is expected to fail with this result:

Max ATOL Delta: 1.0, Max RTOL Delta: nan, PCC: 0.0, Equal check failed. 

Also, all of the units when run print the next warning: WARNING | tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs:get_pcc:37 - One tensor is all zero

Note Unit test test_isnan.py for grayskull has the same code as unit test for WH, so to avoid code duplication we use the same unit to replicate errors on GS.

Getting Additional info for the operation under test and its behavior To get additional information and results for different combinations of input shapes, types, layouts and memory configs for which this operation was tested you can also run locally sweeps:

tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests/broken_wormhole/pytorch_eltwise_isfinite_test.yaml
tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests/broken_wormhole/pytorch_eltwise_isinf_test.yaml
tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests/broken_wormhole/pytorch_eltwise_isnan_test.yaml
tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests/broken_wormhole/pytorch_eltwise_isneginf_test.yaml
tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests/broken_wormhole/pytorch_eltwise_isposinf_test.yaml
tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/wormhole/ttnn_eltwise_isfinite_test.yaml
tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/wormhole/ttnn_eltwise_isinf_test.yaml
tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/wormhole/ttnn_eltwise_isnan_test.yaml
tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/wormhole/ttnn_eltwise_isneginf_test.yaml
tests/ttnn/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/wormhole/ttnn_eltwise_isposinf_test.yaml

To do this you should:

  1. Follow the Getting Started page to setup the repo, environment variables and python-env
  2. Activate source build/python_env/bin/activate
  3. Run sweeps by using python tests/tt_eager/python_api_testing/sweep_tests/run_pytorch_test.py -i tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests/broken_wormhole/pytorch_eltwise_isinf_test.yaml -o ./result-sweeps
  4. After the run is completed all test sweeps results should be available inside specified output directory (in this case ./result-sweeps). There you will find eltwise_isinf_sweep.csv which holds all executed sweeps.
muthutt commented 10 months ago

we cannot indicate nan in our hardware SFPU for all 3 generations GS, WH and BH

umadevimcw commented 7 months ago

@jliangTT As mentioned in this comment https://github.com/tenstorrent-metal/tt-metal/issues/4409#issuecomment-1863973145 the issue is specific to hardware. How do we proceed further?

eyonland commented 6 months ago

@umadevimcw, my understanding is that our hardware does not support NaN as part of a design decision and that the values should return the Max/Min value for the respective type. Given this, I believe the correct thing here is to assume -Inf is the Min and +Inf is the Max for the dtype and have the test assert on this.

@davorchap, would you be able to confirm my understanding here?

eyonland commented 6 months ago

@tt-aho , would you be able to provide insight on the above questions here?

tt-aho commented 6 months ago

Hey Eyon, I don't really have any further insight into the above. What you proposed sounds reasonable to me if this is a limitation of our hw. Have we talked to llk team about this limitation to see if there are any workarounds?

umadevimcw commented 4 months ago

@hschoi4448 @razorback3 https://github.com/tenstorrent/tt-metal/issues/8944, https://github.com/tenstorrent/tt-metal/issues/8945#issuecomment-2146247945, Please look at this comments for this issue

VirdhatchaniKN commented 1 week ago

This issue is now currently handled here : https://github.com/tenstorrent/tt-metal/issues/14077. Will be assigned to TT. Hence moving this to blocked