Open JelenaTosicJtosic opened 10 months ago
we cannot indicate nan in our hardware SFPU for all 3 generations GS, WH and BH
@jliangTT As mentioned in this comment https://github.com/tenstorrent-metal/tt-metal/issues/4409#issuecomment-1863973145 the issue is specific to hardware. How do we proceed further?
@umadevimcw, my understanding is that our hardware does not support NaN as part of a design decision and that the values should return the Max/Min value for the respective type. Given this, I believe the correct thing here is to assume -Inf is the Min and +Inf is the Max for the dtype and have the test assert on this.
@davorchap, would you be able to confirm my understanding here?
@tt-aho , would you be able to provide insight on the above questions here?
Hey Eyon, I don't really have any further insight into the above. What you proposed sounds reasonable to me if this is a limitation of our hw. Have we talked to llk team about this limitation to see if there are any workarounds?
@hschoi4448 @razorback3 https://github.com/tenstorrent/tt-metal/issues/8944, https://github.com/tenstorrent/tt-metal/issues/8945#issuecomment-2146247945, Please look at this comments for this issue
This issue is now currently handled here : https://github.com/tenstorrent/tt-metal/issues/14077. Will be assigned to TT. Hence moving this to blocked
Multiple operations that check if tensor is inf or NaN fail on Wormhole cards with 0 PCC in all combinations (any memory layout, type or buffer layout). Problem hapens for both tt_lib.tensor and ttnn variants.
To Reproduce Steps to reproduce the behavior:
main
branchfor tt_lib variant. Or:
for ttnn variant.
Expected behavior There are two test cases presented in each of the previously mentioned unit tests. For GS only isnan is expected to fail and for WH all other units are expected to fail with low PCC error. For example, one of the tests is expected to fail with this result:
Also, all of the units when run print the next warning:
WARNING | tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs:get_pcc:37 - One tensor is all zero
Note Unit test test_isnan.py for grayskull has the same code as unit test for WH, so to avoid code duplication we use the same unit to replicate errors on GS.
Getting Additional info for the operation under test and its behavior To get additional information and results for different combinations of input shapes, types, layouts and memory configs for which this operation was tested you can also run locally sweeps:
To do this you should:
Getting Started
page to setup the repo, environment variables andpython-env
source build/python_env/bin/activate
python tests/tt_eager/python_api_testing/sweep_tests/run_pytorch_test.py -i tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests/broken_wormhole/pytorch_eltwise_isinf_test.yaml -o ./result-sweeps
eltwise_isinf_sweep.csv
which holds all executed sweeps.