tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
399 stars 50 forks source link

[Bug Report] Invalid recip result #6720

Open hschoi4448 opened 6 months ago

hschoi4448 commented 6 months ago

Describe the bug A clear and concise description of what the bug is.

The recip function returns an invalid value.

To Reproduce Steps to reproduce the behavior:

  1. Copy and past below code
    
    # SPDX-FileCopyrightText: © 2023 Tenstorrent Inc.

SPDX-License-Identifier: Apache-2.0

import torch import pytest import tt_lib from tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs import data_gen_pt_tt, compare_results

import ttnn from tests.tt_eager.python_api_testing.sweep_tests import pytorch_ops

def data_gen_pt_tt(input_shapes, device, required_grad=False, val=1): pt_tensor = (torch.ones(input_shapes, requires_grad=required_grad) * val).bfloat16() tt_tensor = ( tt_lib.tensor.Tensor(pt_tensor, tt_lib.tensor.DataType.BFLOAT16).to(tt_lib.tensor.Layout.TILE).to(device) ) return pt_tensor, tt_tensor

@pytest.mark.parametrize( "input_shapes", ( (torch.Size([1, 1, 32, 32])), ), ) def test1(input_shapes, device): print("==============================") print("recip") val = 0 in_data, input_tensor = data_gen_pt_tt(input_shapes, device, True, val=val)

print("input_tensor", input_tensor)

golden_tensor = pytorch_ops.recip(in_data)
tt_output_tensor_on_device = tt_lib.tensor.recip(input_tensor)

print("tt_output_tensor_on_device", tt_output_tensor_on_device)
print("golden_tensor", golden_tensor)

print("==============================")
print("div_unary")
val = 1
in_data, input_tensor = data_gen_pt_tt(input_shapes, device, True, val=val)
print("input_tensor", input_tensor)

golden_tensor = pytorch_ops.div_unary(in_data , scalar=0)
tt_output_tensor_on_device = tt_lib.tensor.div_unary(input_tensor , scalar=0)
print("tt_output_tensor_on_device", tt_output_tensor_on_device)
print("golden_tensor", golden_tensor)

2. Run with pytest
```Python
==============================
recip
input_tensor ttnn.Tensor([[[[ 0.00000,  0.00000,  ...,  0.00000,  0.00000],
               [ 0.00000,  0.00000,  ...,  0.00000,  0.00000],
               ...,
               [ 0.00000,  0.00000,  ...,  0.00000,  0.00000],
               [ 0.00000,  0.00000,  ...,  0.00000,  0.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
tt_output_tensor_on_device ttnn.Tensor([[[[169476569462576773795235400185743933440.00000, 169476569462576773795235400185743933440.00000,  ..., 169476569462576773795235400185743933440.00000, 169476569462576773795235400185743933440.00000],
               [169476569462576773795235400185743933440.00000, 169476569462576773795235400185743933440.00000,  ..., 169476569462576773795235400185743933440.00000, 169476569462576773795235400185743933440.00000],
               ...,
               [169476569462576773795235400185743933440.00000, 169476569462576773795235400185743933440.00000,  ..., 169476569462576773795235400185743933440.00000, 169476569462576773795235400185743933440.00000],
               [169476569462576773795235400185743933440.00000, 169476569462576773795235400185743933440.00000,  ..., 169476569462576773795235400185743933440.00000, 169476569462576773795235400185743933440.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
golden_tensor tensor([[[[inf, inf, inf,  ..., inf, inf, inf],
          [inf, inf, inf,  ..., inf, inf, inf],
          [inf, inf, inf,  ..., inf, inf, inf],
          ...,
          [inf, inf, inf,  ..., inf, inf, inf],
          [inf, inf, inf,  ..., inf, inf, inf],
          [inf, inf, inf,  ..., inf, inf, inf]]]], dtype=torch.bfloat16,
       grad_fn=<ReciprocalBackward0>)
==============================
div_unary
input_tensor ttnn.Tensor([[[[ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
tt_output_tensor_on_device ttnn.Tensor([[[[inf     , inf     ,  ..., inf     , inf     ],
               [inf     , inf     ,  ..., inf     , inf     ],
               ...,
               [inf     , inf     ,  ..., inf     , inf     ],
               [inf     , inf     ,  ..., inf     , inf     ]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
golden_tensor tensor([[[[inf, inf, inf,  ..., inf, inf, inf],
          [inf, inf, inf,  ..., inf, inf, inf],
          [inf, inf, inf,  ..., inf, inf, inf],
          ...,
          [inf, inf, inf,  ..., inf, inf, inf],
          [inf, inf, inf,  ..., inf, inf, inf],
          [inf, inf, inf,  ..., inf, inf, inf]]]], dtype=torch.bfloat16,
       grad_fn=<DivBackward0>)

Expected behavior A clear and concise description of what you expected to happen.

The results of calculating 1/0 using recip and using div_unary should be the same.

Screenshots If applicable, add screenshots to help explain your problem.

Please complete the following environment information:

Additional context Add any other context about the problem here.

Aswinmcw commented 5 months ago

Hi @hschoi4448, @eyonland @tt-aho PR #7000 will fix this issue

umadevimcw commented 1 month ago

@eyonland We need to add reference to docs/guidelines (as discussed in the call) to complete this issue as it is related to storing an inf

umadevimcw commented 2 weeks ago

@hschoi4448 Fix for this issue available in this PR (due to hardware limitations nan/inf are replaced with the numbers) https://github.com/tenstorrent/tt-metal/pull/11243 Kindly review it.

razorback3 commented 2 weeks ago

@umadevimcw We will take a review. Please give us 1-2 days :)